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Preface 


In graduate school, and for the first few years as an assistant professor, my 
research was in pure mathematics, mainly topology and functional anal- 
ysis. Around 1979 I was drawn, largely by accident, into signal process- 
ing, collaborating with friends at the Naval Research Laboratory who were 
working on sonar. Initially, I felt that the intersection of the mathematics 
that I knew and that they knew was nearly empty. After a while, I began 
to realize that the basic tools of signal processing are subjects with which 
I was already somewhat familiar, including Fourier series, matrices, and 
probability and statistics. Much of the jargon and notation seemed foreign 
to me, and I did not know much about the particular applications everyone 
else was working on. For a while it seemed that everyone else was speaking 
a foreign language. However, my knowledge of the basic mathematical tools 
helped me gradually to understand what was going on and, eventually, to 
make a contribution. 

Signal processing is, in a sense, applied Fourier analysis, applied linear 
algebra, and some probability and statistics. I had studied Fourier series 
and linear algebra as an undergraduate, and had taught linear algebra 
several times. I had picked up some probability and statistics as a professor, 
although I had never had a course in that subject. Now I was beginning to 
see these tools in a new light; Fourier coefficients arise as measured data in 
array processing and tomography, eigenvectors and eigenvalues are used to 
locate sonar and radar targets, matrices become images and the singular- 
value decomposition provides data compression. For the first time, I saw 
Fourier series, matrices and probability and statistics used all at once, in the 
analysis of the sampled cross-sensor correlation matrices and the estimation 
of power spectra. 

In my effort to learn signal processing, I consulted a wide variety of 
texts. Each one helped me somewhat, but I found no text that spoke di- 
rectly to people in my situation. The texts I read were either too hard, 
too elementary, or written in what seemed to me to be a foreign language. 
Some texts in signal processing are written by engineers for engineering 
students, and necessarily rely only on those mathematical notions their 
students have encountered previously. In texts such as [116] basic Fourier 
series and transforms are employed, but there is little discussion of matri- 
ces and no mention of probability and statistics, hence no random models. 
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I found the book [121] by Papoulis helpful, although most of the exam- 
ples deal with issues of interest primarily to electrical engineers. The books 
written by mathematicians tend to treat signal processing as a part of 
harmonic analysis or of stochastic processes. Books about Fourier analysis 
focus on its use in partial differential equations, or explore rigorously the 
mathematical aspects of the subject. I was looking for something different. 
It would have helped me a great deal if there had been a book addressed to 
people like me, people with a decent mathematical background who were 
trying to learn signal processing. My hope is that this book serves that 
purpose. 

There are many opportunities for mathematically trained people to 
make a contribution in signal and image processing, and yet few mathemat- 
ics departments offer courses in these subjects to their students, preferring 
to leave it to the engineering departments. One reason, I imagine, is that 
few mathematics professors feel qualified to teach the subject. My message 
here is that they probably already know a good deal of signal processing, 
but do not realize that they know it. This book is designed to help them 
come to that realization and to encourage them to include signal processing 
as a course for their undergraduates. 

The situations of interest that serve to motivate much of what is dis- 
cussed in this book can be summarized as follows: We have obtained data 
through some form of sensing; physical models, often simplified, describe 
how the data we have obtained relates to the information we seek; there 
usually isn’t enough data and what we have is corrupted by noise, mod- 
eling errors, and other distortions. Although applications differ from one 
another in their details, they often make use of a common core of mathe- 
matical ideas. For example, the Fourier transform and its variants play an 
important role in remote sensing, and therefore in many areas of signal and 
image processing, as do the language and theory of matrix analysis, itera- 
tive optimization and approximation techniques, and the basics of proba- 
bility and statistics. This common core provides the subject matter for this 
text. Applications of the core material to tomographic medical imaging, 
optical imaging, and acoustic signal processing are included in this book. 

The term signal processing is used here in a somewhat restrictive sense 
to describe the extraction of information from measured data. I believe 
that to get information out we must put information in. How to use the 
mathematical tools to achieve this is one of the main topics of the book. 

This text is designed to provide a bridge to help those with a solid math- 
ematical background to understand and employ signal processing tech- 
niques in an applied environment. The emphasis is on a small number of 
fundamental problems and essential tools, as well as on applications. Cer- 
tain topics that are commonly included in textbooks are touched on only 
briefly or in exercises or not mentioned at all. Other topics not usually 
considered to be part of signal processing, but which are becoming increas- 
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ingly important, such as iterative optimization methods, are included. The 
book, then, is a rather personal view of the subject and reflects the author’s 
interests. 

The term signal is not meant to imply a restriction to functions of a 
single variable; indeed, most of what we discuss in this text applies equally 
to functions of one and several variables and therefore to image process- 
ing. However, there are special problems that arise in image processing, 
such as edge detection, and special techniques to deal with such prob- 
lems; we shall not consider such techniques in this text. Topics discussed 
include the following: Fourier series and transforms in one and several vari- 
ables; applications to acoustic and electro-magnetic propagation models, 
transmission and emission tomography, and image reconstruction; sam- 
pling and the limited data problem; matrix methods, singular value de- 
composition, and data compression; optimization techniques in signal and 
image reconstruction from projections; autocorrelations and power spectra; 
high-resolution methods; detection and optimal filtering; eigenvector-based 
methods for array processing and statistical filtering, time-frequency anal- 
ysis, and wavelets. 

The ordering of the first eighteen chapters of the book is not random; 
these main chapters should be read in the order of their appearance. The 
remaining chapters are ordered randomly and are meant to supplement the 
main chapters. 

Reprints of my journal articles referenced here are available in pdf for- 
mat at my website, http://faculty.uml.edu/cbyrne/cbyrne.html. 
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2 Signal Processing: A Mathematical Approach 


1.1 Chapter Summary 


We begin with an overview of applications of signal processing and the 
variety of sensing modalities that are employed. It is typical of remote- 
sensing problems that what we want is not what we can measure directly, 
and we must obtain our information by indirect means. To illustrate that 
point without becoming entangled in the details of any particular applica- 
tion, we present a marbles-in-bowls model of remote sensing that, although 
simple, still manages to capture the dominate aspects of many real-world 
problems. 


1.2 Aims and Topics 


The term signal processing has broad meaning and covers a wide variety 
of applications. In this course we focus on those applications of signal pro- 
cessing that can loosely be called remote sensing, although the mathematics 
we shall study is fundamental to all areas of signal processing. 

In a course in signal processing it is easy to get lost in the details 
and lose sight of the big picture. My main objectives here are to present 
the most important ideas, techniques, and methods, to describe how they 
relate to one another, and to illustrate their uses in several applications. 
For signal processing, the most important mathematical tools are Fourier 
series and related notions, matrices, and probability and statistics. Most 
students with a solid mathematical background have probably encountered 
each of these topics in previous courses, and therefore already know some 
signal processing, without realizing it. 

Our discussion here will involve primarily functions of a single real vari- 
able, although most of the concepts will have multi-dimensional versions. 
It is not our objective to treat each topic with the utmost mathematical 
rigor, and we shall seek to avoid issues that are primarily of mathematical 
concern. 


1.2.1 The Emphasis in This Book 


This text is designed to provide the necessary mathematical background 
to understand and employ signal processing techniques in an applied en- 
vironment. The emphasis is on a small number of fundamental problems 
and essential tools, as well as on applications. Certain topics that are com- 
monly included in textbooks are touched on only briefly or in exercises or 
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not mentioned at all. Other topics not usually considered to be part of 
signal processing, but which are becoming increasingly important, such as 
matrix theory and linear algebra, are included. 

The term signal is not meant to imply a specific context or a restriction 
to functions of time, or even to functions of a single variable; indeed, most 
of what we discuss in this text applies equally to functions of one and 
several variables and therefore to image processing. However, this is in no 
sense an introduction to image processing. There are special problems that 
arise in image processing, such as edge detection, and special techniques to 
deal with such problems; we shall not consider such techniques in this text. 


1.2.2 Topics Covered 


Topics discussed in this text include the following: Fourier series and 
transforms in one and several variables; applications to acoustic and EM 
propagation models, transmission and emission tomography, and image re- 
construction; sampling and the limited data problem; matrix methods, sin- 
gular value decomposition, and data compression; optimization techniques 
in signal and image reconstruction from projections; autocorrelations and 
power spectra; high-resolution methods; detection and optimal filtering; 
eigenvector-based methods for array processing and statistical filtering; 
time-frequency analysis; and wavelets. 


1.2.3 Limited Data 


As we shall see, it is often the case that the data we measure is not 
sufficient to provide a single unique answer to our problem. There may 
be many, often quite different, answers that are consistent with what we 
have measured. In the absence of prior information about what the answer 
should look like, we do not know how to select one solution from the many 
possibilities. For that reason, I believe that to get information out we must 
put information in. How to do this is one of the main topics of the course. 
The example at the end of this chapter will illustrate this point. 


1.3 Examples and Modalities 


There are a wide variety of problems in which what we want to know 
about is not directly available to us and we need to obtain information 
by more indirect methods. In this section we present several examples of 
remote sensing. The term “modality” refers to the manner in which the 
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desired information is obtained. Although the sensing of acoustic and elec- 
tromagnetic signals is perhaps the most commonly used method, remote 
sensing involves a wide variety of modalities: electromagnetic waves (light, 
x-ray, microwave, radio); sound (sonar, ultrasound); radioactivity (positron 
and single-photon emission); magnetic resonance (MRI); seismic waves; and 
a number of others. 


1.3.1 X-ray Crystallography 


The patterns produced by the scattering of x-rays passing through var- 
ious materials can be used to reveal their molecular structure. 


1.3.2 Transmission Tomography 


In transmission tomography x-rays are transmitted along line segments 
through the object and the drop in intensity along each line is recorded. 


1.3.3 Emission Tomography 


In emission tomography radioactive material is injected into the body 
of the living subject and the photons resulting from the radioactive decay 
are detected and recorded outside the body. 


1.3.4 Back-Scatter Detectors 


There is considerable debate at the moment about the use of so-called 
full-body scanners at airports. These are not scanners in the sense of a 
CAT scan; indeed, if the images were skeletons there would probably be 
less controversy. These are images created by the returns, or backscatter, of 
millimeter-wavelength (MMW) radio-frequency waves, or sometimes low- 
energy x-rays, that penetrate only the clothing and then reflect back to the 
machine. 

The controversies are not really about safety to the passenger being 
imaged. The MMW imaging devices use about 10,000 times less energy 
than a cell phone, and the x-ray exposure is equivalent to two minutes 
of flying in an airplane. At present, the images are fuzzy and faces are 
intentionally blurred, but there is some concern that the images will get 
sharper, will be permanently stored, and eventually end up on the net. 
Given what is already available on the net, the market for these images 
will almost certainly be non-existent. 
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1.3.5 Cosmic-Ray Tomography 


Because of their ability to penetrate granite, cosmic rays are being used 
to obtain transmission-tomographic three-dimensional images of the inte- 
riors of active volcanos. Where magma has replaced granite there is less 
attenuation of the rays, so the image can reveal the size and shape of the 
magma column. It is hoped that this will help to predict the size and oc- 
currence of eruptions. 

In addition to mapping the interior of volcanos, cosmic rays can also be 
used to detect the presence of shielding around nuclear material in a cargo 
container. The shielding can be sensed by the characteristic scattering by 
it of muons from cosmic rays; here neither we nor the objects of interest 
are the sources of the probing. This is about as “remote” as sensing can 
be. 


1.3.6 Ocean-Acoustic Tomography 


The speed of sound in the ocean varies with the temperature, among 
other things. By transmitting sound from known locations to known re- 
ceivers and measuring the travel times we can obtain line integrals of the 
temperature function. Using the reconstruction methods from transmission 
tomography, we can estimate the temperature function. Knowledge of the 
temperature distribution may then be used to improve detection of sources 
of acoustic energy in unknown locations. 


1.3.7 Spectral Analysis 


In our detailed discussion of transmission and remote sensing we shall, 
for simplicity, concentrate on signals consisting of a single frequency. Never- 
theless, there are many important applications of signal processing in which 
the signal being studied has a broad spectrum, indicative of the presence 
of many different frequencies. The purpose of the processing is often to 
determine which frequencies are present, or not present, and to determine 
their relative strengths. The hotter inner body of the sun emits radiation 
consisting of a continuum of frequencies. The cooler outer layer absorbs 
the radiation whose frequencies correspond to the elements present in that 
outer layer. Processing these signals reveals a spectrum with a number of 
missing frequencies, the so-called Fraunhofer lines, and provides informa- 
tion about the makeup of the sun’s outer layers. This sort of spectral anal- 
ysis can be used to identify the components of different materials, making 
it an important tool in many applications, from astronomy to forensics. 
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1.3.8 Seismic Exploration 


Oil companies want to know if it is worth their while drilling in a partic- 
ular place. If they go ahead and drill, they will find out, but they would like 
to know what is the chance of finding oil without actually drilling. Instead, 
they set off explosions and analyze the signals produced by the seismic 
waves, which will tell them something about the materials the waves en- 
countered. Explosive charges create waves that travel through the ground 
and are picked up by sensors. The waves travel at different speeds through 
different materials. Information about the location of different materials in 
the ground is then extracted from the received signals. 


1.3.9 Astronomy 


Astronomers know that there are radio waves, visible-light waves, and 
other forms of electro-magnetic radiation coming from the sun and distant 
regions of space, and they would like to know precisely what is coming 
from which regions. They cannot go there to find out, so they set up large 
telescopes and antenna arrays and process the signals that they are able to 
measure. 


1.3.10 Radar 


Those who predict the weather use radar to help them see what is going 
on in the atmosphere. Radio waves are sent out and the returns are analyzed 
and turned into images. The location of airplanes is also determined by 
radar. The radar returns from different materials are different from one 
another and can be analyzed to determine what materials are present. 
Synthetic-aperture radar is used to obtain high-resolution images of regions 
of the earth’s surface. The radar returns from different geometric shapes 
also differ in strength; by avoiding right angles in airplane design stealth 
technology attempts to make the plane invisible to radar. 


1.3.11 Sonar 


Features on the bottom of the ocean are imaged with sonar, in which 
sound waves are sent down to the bottom and the returning waves are 
analyzed. Sometimes near or distant objects of interest in the ocean emit 
their own sound, which is measured by sensors. The signals received by the 
sensors are processed to determine the nature and location of the objects. 
Even changes in the temperature at different places in the ocean can be 
determined by sending sound waves through the region of interest and 
measuring the travel times. 
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1.3.12 Gravity Maps 


The pull of gravity varies with the density of the material. Features on 
the surface of the earth, such as craters from ancient asteroid impacts, can 
be imaged by mapping the variations in the pull of gravity, as measured by 
satellites. 

Gravity, or better, changes in the pull of gravity from one location to 
another, was used in the discovery of the crater left behind by the asteroid 
strike in the Yucatan that led to the extinction of the dinosaurs. The rocks 
and other debris that eventually filled the crater differ in density from 
the surrounding material, thereby exerting a slightly different gravitational 
pull on other masses. This slight change in pull can be detected by sensitive 
instruments placed in satellites in earth orbit. When the intensity of the 
pull, as a function of position on the earth’s surface, is displayed as a two- 
dimensional image, the presence of the crater is evident. 

Studies of the changes in gravitational pull of the Antarctic ice between 
2002 and 2005 revealed that Antarctica is losing 36 cubic miles of ice each 
year. By way of comparison, the city of Los Angeles uses one cubic mile of 
water each year. While this finding is often cited as clear evidence of global 
warming, it contradicts some models of climate change that indicate that 
global warming may lead to an increase of snowfall, and therefore more ice, 
in the polar regions. This does not show that global warming is not taking 
place, but only the inadequacies of some models [119]. 


1.3.13 Echo Cancellation 


In a conference call between locations A and B, what is transmitted 
from A to B can get picked up by microphones in B, transmitted back 
to speakers in A and then retransmitted to B, producing an echo of the 
original transmission. Signal processing performed at the transmitter in 
A can reduce the strength of the second version of the transmission and 
decrease the echo effect. 


1.3.14 Hearing Aids 


Makers of digital hearing aids include signal processing to enhance the 
quality of the received sounds, as well as to improve localization, that is, 
the ability of the hearer to tell where the sound is coming from. When a 
hearing aid is used, sounds reach the ear in two ways: first, the usual route 
directly into the ear, and second, through the hearing aid. Because that part 
that passes through the hearing aid is processed, there is a slight delay. In 
order for the delay to go unnoticed, the processing must be very fast. When 
hearing aids are used in both ears, more sophisticated processing can be 
used. 
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1.3.15 Near-Earth Asteroids 


An area of growing importance is the search for potentially damaging 
near-earth asteroids. These objects are initially detected by passive op- 
tical observation, as small dots of reflected sunlight; once detected, they 
are then imaged by active radar to determine their size, shape, rotation, 
path, and other important parameters. Satellite-based infrared detectors 
are being developed to find dark asteroids by the heat they give off. Such 
satellites, placed in orbit between the sun and the earth, will be able to 
detect asteroids hidden from earth-based telescopes by the sunlight. 


1.3.16 Mapping the Ozone Layer 


Ultraviolet light from the sun is scattered by ozone. By measuring the 
amount of scattered UV at various locations on the earth’s surface, and with 
the sun in various positions, we obtain values of the Laplace transform of 
the function describing the density of ozone, as a function of elevation. 


1.3.17 Ultrasound Imaging 


While x-ray tomography is a powerful method for producing images 
of the interior of patients’ bodies, the radiation involved and the expense 
make it unsuitable in some cases. Ultrasound imaging, making use of back- 
scattered sound waves, is a popular method of inexpensive preliminary 
screening for medical diagnostics, and for examining a developing fetus. 


1.3.18 X-ray Vision? 


The MIT computer scientist and electrical engineer Dina Katabi and 
her students are currently exploring new uses of wireless technologies. By 
combining Wi-Fi and vision into what she calls Wi- Vi, she has discovered 
a way to detect the number and approximate location of persons within a 
closed room and to recognize simple gestures. The scattering of reflected 
low-bandwidth wireless signals as they pass through the walls is processed 
to eliminate motionless sources of reflection from the much weaker reflec- 
tions from moving objects, presumably people. 


1.4 The Common Core 


The examples just presented look quite different from one another, but 
the differences are often more superficial than real. As we begin to use 
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mathematics to model these various situations we often discover a common 
core of mathematical tools and ideas at the heart of each of these applica- 
tions. For example, the Fourier transform and its variants play an impor- 
tant role in many areas of signal and image processing, as do the language 
and theory of matrix analysis, iterative optimization and approximation 
techniques, and the basics of probability and statistics. This common core 
provides the subject matter for this book. Applications of the core mate- 
rial to tomographic medical imaging, optical imaging, and acoustic signal 
processing are among the topics to be discussed in some detail. 

Although the applications of interest to us vary in their details, they 
have common aspects that can be summarized as follows: the data has been 
obtained through some form of sensing; physical models, often simplified, 
describe how the data we have obtained relates to the information we seek; 
there usually isn’t enough data and what we have is corrupted by noise 
and other distortions. 


1.5 Active and Passive Sensing 


In some signal and image processing applications the sensing is ac- 
tive, meaning that we have initiated the process, by, say, sending an x-ray 
through the body of a patient, injecting a patient with a radionuclide, trans- 
mitting an acoustic signal through the ocean, as in sonar, or transmitting 
a radio wave, as in radar. In such cases, we are interested in measuring 
how the system, the patient, the quiet submarine, the ocean floor, the rain 
cloud, will respond to our probing. In many other applications, the sens- 
ing is passive, which means that the object of interest to us provides its 
own signal of some sort, which we then detect, analyze, image, or process 
in some way. Certain sonar systems operate passively, listening for sounds 
made by the object of interest. Optical and radio telescopes are passive, 
relying on the object of interest to emit or reflect light, or other electromag- 
netic radiation. Night-vision instruments are sensitive to lower-frequency, 
infrared radiation. 

From the time of Aristotle and Euclid until the middle ages there was an 
ongoing debate concerning the active or passive nature of human sight [112]. 
Those like Euclid, whose interests were largely mathematical, believed that 
the eye emitted rays, the extramission theory. Aristotle and others, more 
interested in the physiology and anatomy of the eye than in mathematics, 
believed that the eye received rays from observed objects outside the body, 
the intromission theory. Finally, around 1000 AD, the Arabic mathemati- 
cian and natural philosopher Alhazen demolished the extramission theory 
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by noting the potential for bright light to hurt the eye, and combined the 
mathematics of the extramission theorists with a refined theory of intro- 
mission. The extramission theory has not gone away completely, however, 
as anyone familiar with Superman’s x-ray vision knows. 


1.6 Using Prior Knowledge 


An important point to keep in mind when doing signal processing is 
that, while the data is usually limited, the information we seek may not be 
lost. Although processing the data in a reasonable way may suggest other- 
wise, other processing methods may reveal that the desired information is 
still available in the data. Figure 1.1 illustrates this point. 

The original image on the upper right of Figure 1.1 is a discrete rect- 
angular array of intensity values simulating the distribution of the x-ray- 
attenuating material in a slice of a head. The data was obtained by taking 
the two-dimensional discrete Fourier transform of the original image, and 
then discarding, that is, setting to zero, all these spatial frequency values, 
except for those in a smaller rectangular region around the origin. Recon- 
structing the image from this limited data amounts to solving a large system 
of linear equations. The problem is under-determined, so a minimum-norm 
solution would seem to be a reasonable reconstruction method. For now, 
“norm” means the Euclidean norm. 

The minimum-norm solution is shown on the lower right. It is calcu- 
lated simply by performing an inverse discrete Fourier transform on the 
array of modified discrete Fourier transform values. The original image has 
relatively large values where the skull is located, but the least-squares re- 
construction does not want such high values; the norm involves the sum 
of squares of intensities, and high values contribute disproportionately to 
the norm. Consequently, the minimum-norm reconstruction chooses instead 
to conform to the measured data by spreading what should be the skull 
intensities throughout the interior of the skull. The minimum-norm recon- 
struction does tell us something about the original; it tells us about the 
existence of the skull itself, which, of course, is indeed a prominent feature 
of the original. However, in all likelihood, we would already know about 
the skull; it would be the interior that we want to know about. 

Using our knowledge of the presence of a skull, which we might have ob- 
tained from the minimum-norm reconstruction itself, we construct the prior 
estimate shown in the upper left. Now we use the same data as before, and 
calculate a minimum-weighted-norm reconstruction, using as the weight 
vector the reciprocals of the values of the prior image. This minimum- 
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FIGURE 1.1: Extracting information in image reconstruction. 


weighted-norm reconstruction, also called the PDFT estimator, is shown 
on the lower left; it is clearly almost the same as the original image. The 
calculation of the minimum-weighted-norm solution can be done iteratively 
using the ART algorithm [143]. 

When we weight the skull area with the inverse of the prior image, 
we allow the reconstruction to place higher values there without having 
much of an effect on the overall weighted norm. In addition, the reciprocal 
weighting in the interior makes spreading intensity into that region costly, 
so the interior remains relatively clear, allowing us to see what is really 
present there. 

When we try to reconstruct an image from limited data, it is easy to 
assume that the information we seek has been lost, particularly when a 
reasonable reconstruction method fails to reveal what we want to know. As 
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this example, and many others, show, the information we seek is often still 
in the data, but needs to be brought out in a more subtle way. 


1.7 An Urn Model of Remote Sensing 


Most of the signal processing that we shall discuss in this book is re- 
lated to the problem of remote sensing, which we might also call indirect 
measurement. In such problems we do not have direct access to what we are 
really interested in, and must be content to measure something else that is 
related to, but not the same as, what interests us. For example, we want 
to know what is in the suitcases of airline passengers, but, for practical 
reasons, we cannot open every suitcase. Instead, we x-ray the suitcases. A 
recent paper [137] describes progress in detecting nuclear material in cargo 
containers by measuring the scattering, by the shielding, of cosmic rays; 
you can’t get much more remote than that. Before we get into the mathe- 
matics of signal processing, it is probably a good idea to consider a model 
that, although quite simple, manages to capture many of the important 
features of remote-sensing applications. To convince the reader that this is 
indeed a useful model, we relate it to the problem of image reconstruction 
in single-photon emission computed tomography (SPECT). There seems to 
be a tradition in physics of using simple models or examples involving 
urns and marbles to illustrate important principles. In keeping with that 
tradition, we have here two examples, both involving urns of marbles, to 
illustrate various aspects of remote sensing. 


1.7.1 An Urn Model 


Suppose that there is a box containing a large number of small pieces 
of paper, and on each piece is written one of the numbers from j = 1 
to j = J. I want to determine, for each j = 1,..., J, the probability of 
selecting a piece of paper with the number j written on it. Unfortunately, 
I am not allowed to examine the box. I am allowed, however, to set up a 
remote-sensing experiment to help solve my problem. 

My assistant sets up J urns, numbered j = 1,..., J, each containing mar- 
bles of various colors. Suppose that there are J colors, numbered i = 1,..., I. 
I am allowed to examine each urn, so I know precisely the probability that 
a marble of color 7 will be drawn from urn j. Out of my view, my assis- 
tant removes one piece of paper from the box, takes one marble from the 
indicated urn, announces to me the color of the marble, and then replaces 
both the piece of paper and the marble. This action is repeated N times, 
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at the end of which I have a long list of colors, i = {i1, i2,... iN}, where 
in denotes the color of the nth marble drawn. This list i is my data, from 
which I must determine the contents of the box. 

This is a form of remote sensing; what we have access to is related to, 
but not equal to, what we are interested in. What I wish I had is the list of 
urns used, j = {J1, j2, ---, jn}; instead I have i, the list of colors. Sometimes 
data such as the list of colors is called “incomplete data,” in contrast to 
the “complete data,” which would be the list j of the actual urn numbers 
drawn from the box. 

Using our urn model, we can begin to get a feel for the resolution prob- 
lem. If all the marbles of one color are in a single urn, all the black marbles 
in urn j = 1, all the green in urn j = 2, and so on, the problem is trivial; 
when I hear a color, I know immediately which urn contained that marble. 
My list of colors is then a list of urn numbers; i = j. I have the complete 
data now. My estimate of the number of pieces of paper containing the 
urn number j is then simply the proportion of draws that resulted in urn 
j being selected. 

At the other extreme, suppose two urns have identical contents. Then I 
cannot distinguish one urn from the other and I am unable to estimate more 
than the total number of pieces of paper containing either of the two urn 
numbers. If the two urns have nearly the same contents, we can distinguish 
them only by using a very large N. This is the resolution problem. 

Generally, the more the contents of the urns differ, the easier the task 
of estimating the contents of the box. In remote-sensing applications, these 
issues affect our ability to resolve individual components contributing to 
the data. 


1.7.2 Some Mathematical Notation 


To introduce some mathematical notation, let us denote by x; the pro- 
portion of the pieces of paper that have the number j written on them. Let 
P,; be the proportion of the marbles in urn j that have the color i. Let y; be 
the proportion of times the color į occurs in the list of colors. The expected 
proportion of times 7 occurs in the list is E(y;) = sre Pix; = (Px), 
where P is the I by J matrix with entries P,; and x is the J by 1 column 
vector with entries zj. A reasonable way to estimate x is to replace E(y;) 
with the actual y; and solve the system of linear equations y; = ea Pi Xj; 
i = 1,..., I. Of course, we require that the x; be nonnegative and sum to 
one, so special algorithms may be needed to find such solutions. In a num- 
ber of applications that fit this model, such as medical tomography, the 
values x; are taken to be parameters, the data y; are statistics, and the £j 
are estimated by adopting a probabilistic model and maximizing the likeli- 
hood function. Iterative algorithms, such as the expectation maximization 
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maximum likelihood (EMML) algorithm, are often used for such problems; 
see Chapter 14 for details. 


1.7.3 An Application to SPECT Imaging 


In single-photon emission computed tomography (SPECT) the patient 
is injected with a chemical to which a radioactive tracer has been attached. 
Once the chemical reaches its destination within the body the photons 
emitted by the radioactive tracer are detected by gamma cameras outside 
the body. The objective is to use the information from the detected photons 
to infer the relative concentrations of the radioactivity within the patient. 

We discretize the problem and assume that the body of the patient 
consists of J small volume elements, called voxels, analogous to pixels in 
digitized images. We let x; > 0 be the unknown proportion of the radioac- 
tivity that is present in the jth voxel, for j = 1,..., J. There are I detectors, 
denoted {i = 1,2,..., I}. For each i and j we let P;; be the known prob- 
ability that a photon that is emitted from voxel j is detected at detector 
i; these probabilities are usually determined by examining the relative po- 
sitions in space of voxel j and detector i. We denote by in the detector 
at which the nth emitted photon is detected. This photon was emitted at 
some voxel, denoted jn; we wish that we had some way of learning what 
each jn is, but we must be content with knowing only the in. After N 
photons have been emitted, we have as our data the list i = {71, i2, ...,in}; 
this is our incomplete data. We wish we had the complete data, that is, the 
list j = {j1, j2,---, jw}, but we do not. Our goal is to estimate the frequency 
with which each voxel emitted a photon, which we assume, reasonably, to 
be proportional to the unknown proportions zj, for j = 1,..., J. 

This problem is completely analogous to the urn problem previously 
discussed. Any mathematical method that solves one of these problems 
will solve the other one. In the urn problem, the colors were announced; 
here the detector numbers are announced. There, I wanted to know the 
urn numbers; here I want to know the voxel numbers. There, I wanted to 
estimate the frequency with which the jth urn was used; here, I want to 
estimate the frequency with which the jth voxel is the site of an emission, 
which is assumed to be equal to the proportion of the radionuclide within 
the jth voxel. In the urn model, two urns with nearly the same contents are 
hard to distinguish unless N is very large; here, two neighboring voxels will 
be very hard to distinguish (i.e., to resolve) unless N is very large. But in 
the SPECT case, a large N means a high dosage, which will be prohibited 
by safety considerations. Therefore, we have a built-in resolution problem 
in the SPECT case. 

Both problems are examples of probabilistic mixtures, in which the mix- 
ing probabilities are the x; that we seek. The maximum likelihood (ML) 
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method of statistical parameter estimation can be used to solve such prob- 
lems. The interested reader should consult the text [42]. 


1.8 Hidden Markov Models 


In the urn model we just discussed, the order of the colors in the list is 
unimportant; we could randomly rearrange the colors on the list without 
affecting the nature of the problem. The probability that a green marble 
will be chosen next is the same, whether a blue or a red marble was just 
chosen the previous time. This independence from one selection to another 
is fine for modeling certain physical situations, such as emission tomogra- 
phy. However, there are other situations in which this independence does 
not conform to reality. 

In written English, for example, knowing the current letter helps us, 
sometimes more, sometimes less, to predict what the next letter will be. 
We know that, if the current letter is a “q”, then there is a high probability 
that the next one will be a “u”. So what the current letter is affects the 
probabilities associated with the selection of the next one. 

Spoken English is even tougher. There are many examples in which 
the pronunciation of a certain sound is affected, not only by the sound or 
sounds that preceded it, but by the sound or sounds that will follow. For 
example, the sound of the “e” in the word “bellow” is different from the 
sound of the “e” in the word “below”; the sound changes, depending on 
whether there is a double “I” or a single “IP” following the “e”. Here the 
entire context of the letter affects its sound. 

Hidden Markov models (HMM) are increasingly important in speech 
processing, optical character recognition, and DNA sequence analysis. They 
allow us to incorporate dependence on the context into our model. In this 
section we illustrate HMM using a modification of the urn model. 

Suppose, once again, that we have J urns, indexed by j = 1,..., J and 
I colors of marbles, indexed by i = 1,..., I. Associated with each of the 
J urns is a box, containing a large number of pieces of paper, with the 
number of one urn written on each piece. My assistant selects one box, say 
the joth box, to start the experiment. He draws a piece of paper from that 
box, reads the number written on it, call it 71, goes to the urn with the 
number jı and draws out a marble. He then announces the color. He then 
draws a piece of paper from box number jı, reads the next number, say 
j2, proceeds to urn number j2, etc. After N marbles have been drawn, the 
only data I have is a list of colors, i = {71, 22,...,in}. 
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The transition probability that my assistant will proceed from the urn 
numbered k to the urn numbered j is bjx, with an bjk = 1. The num- 
ber of the current urn is the current state. In an ordinary Markov chain 
model, we observe directly a sequence of states governed by the transition 
probabilities. The Markov chain model provides a simple formalism for de- 
scribing a system that moves from one state into another, as time goes on. 
In the hidden Markov model we are not able to observe the states directly; 
they are hidden from us. Instead, we have indirect observations, the colors 
of the marbles in our urn example. 

The probability that the color numbered 7 will be drawn from the urn 
numbered j is aij, with Sey aij = 1, for all j. The colors announced 
are the visible states, while the unannounced urn numbers are the hidden 
states. 

There are several distinct objectives one can have, when using HMM. 
We assume that the data is the list of colors, i. 


e Evaluation: For given probabilities aj; and 6;,, what is the proba- 
bility that the list i was generated according to the HMM? Here, the 
objective is to see if the model is a good description of the data. 


e Decoding: Given the model, the probabilities, and the list i, what 
list j = {j1, j2,- jN} of urns is most likely to be the list of urns 
actually visited? Now, we want to infer the hidden states from the 
visible ones. 


e Learning: We are told that there are J urns and J colors, but are not 
told the probabilities a;; and bjx. We are given several data vectors i 
generated by the HMM; these are the training sets. The objective is 
to learn the probabilities. 


Once again, the ML approach can play a role in solving these problems [68]. 
The Viterbi algorithm is an important tool used for the decoding phase (see 
[149]). 
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2.1 Chapter Summary 


We begin with Fourier series and Fourier transforms, which are essen- 
tial tools in signal processing. In this chapter we give the formulas for 
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Fourier series and Fourier transforms, in both trigonometric and complex- 
exponential form, summarize their basic properties, and give several ex- 
amples of Fourier-transform pairs. We connect Fourier series to Fourier 
transforms using Shannon’s Sampling Theorem. We solve a heat-equation 
problem to illustrate the use of Fourier series while introducing fundamen- 
tal aspects of inverse problems. We leave to Chapter 26 the more theoretical 
details regarding Fourier series and Fourier transforms. 


2.2 Fourier Series 


Most mathematics students see Fourier series for the first time in a 
course on boundary-value problems. There students usually study the wave 
equation and the heat equation in two dimensions, using the technique of 
separating the space and time variables. Fourier series and Fourier trans- 
forms arise as we attempt to satisfy the initial conditions using a superpo- 
sition of sine and cosine functions. 

Suppose, for concreteness, that we have a function f : [—L, L] > R and 
we want to express this function as a Fourier series. The Fourier series for 
f, relative to the interval [—L, L], is 


f(a) & FE ce (Zr) + bn sin (Za) ; (2.1) 
where the Fourier coefficients an and bn, are 


an = 6 f(x) cos (Zz) dx, (2.2) 


and 


by = ie f(x) sin (Za) dz. (2.3) 


To obtain the formula for, say, a,,, the usual approach is to write 


f(x) = 2 di > an COS (Ze) + by sin (a) , (2.4) 


for |x| < L, multiply both sides of Equation (2.4) by cos (442), and then 
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integrate both sides, integrating term-by-term inside the sum on the right 
side of the equation. Orthogonality then gives the desired answer, since we 


have 
Ti 
/ cos (a) sin (=a) dx = 0, 
L L L 


L 
/ cos (2) cos (2) dz = L, 
LL L L 


and 


L 
/ sin (2) sin (2) dx = L, 
_L L L 


for all m and n, and, for m Æ n, 


and 


L imr . (NT 
5 sin (2) sin (a) dx = 
This derivation of the Fourier coefficients sweeps several important issues 
under the rug, so to speak. 

We haven’t said anything about the properties of the function f, so 
we cannot be sure that the Fourier series converges, for a given x, and 
even if it does, we cannot be sure that the sum of the series is f(x). We 
also have not said anything about the integrability of the function f, and 
have not specified the type of integral being used in Equations (2.2) and 
(2.3). Finally, we have not justified integrating an infinite series term-by- 
term. These are not issues that are easily dealt with and it is reasonable, 
given our aims in this book, to leave those issues under the rug for now 
and to rely on the formulas above without further comment. In signal 
processing our primary concern is computing with measured data, in the 
form of finite-length vectors and matrices. Functions of continuous variables 
and infinite sequences guide our thinking, but enter into our calculations 
only as members of finite-parameter families. 

There are many texts, such as [80], that the reader may consult that ad- 
dress the more mathematical aspects of Fourier analysis. The book [101] by 
Körner is a highly entertaining journey through many aspects for pure and 
applied Fourier analysis, while the small book [51] by Champeney summa- 
rizes, without proofs, most of the relevant theorems pertaining to Fourier 
series and Fourier transforms. The discussion in Chapter 26 is taken largely 
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from [51]. For a sampling of more advanced material on signal processing 
and its applications, the reader may consult [3, 87]. 

At this early stage, it is useful to address the issue of periodicity. Clearly, 
the Fourier series itself can be viewed as a function of x with period 2L. 
Consequently, many books on the subject assume, from the start, that 
the function f is also 2L-periodic. We can, of course, extend the original 
function f to the whole real line as a 2L-period function. If f is continuous 

n [-L, L], but f(—L) # f(L), we can preserve continuity of the periodic 
extension by first reflecting the function about the point x = L, creating a 
function on the interval [—L,3L] that has the same values at —L and 3L, 
and then extending that function as a 4D-periodic function. However, our 
concern here is largely with problems that arise in remote sensing, such as 
radar, sonar, tomography, and the like, in which the function f of interest 
is nonzero only on some finite interval. As we shall see, assuming a periodic 
extension at the start may not be a good idea. 


2.3 Complex Exponential Functions 


The most important functions in signal processing are the complex ex- 
ponential functions. Using trigonometric identities it is easy to show that 
the function h : R —> C defined by 


h(x) = cosx + isin z, 


has the property h(x +y) = h(x)h(y). Therefore, we write it in exponential 
form as h(x) = c”, for some (necessarily complex) scalar c. With x = 1 we 
have 
h(1) = cos1 + isin1 = c. 
Applying the Taylor series expansion 
2 423 
aap E a 
2-3! f 
for t = i we have 
e = cos 1 +isin1. 


Consequently, we have c = ¢ and 


Because it is simpler to work with exponential functions than with trigono- 
metric functions, we use the identities 


thes 
cos x = ze” +e), 
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and 
1 


sina = —(e** —e~*”) 
2i 


to reformulate Fourier series and Fourier transforms in terms of complex 
exponential functions. In place of Equation (2.1) we have 


ja ad 
f(z) as 5 Cne Ee, 
n=—00 


with 
1 f” ; 
nF pE a oe as 2. 
c FT, [fee dx (2.5) 


If f is a continuous function, with f(—L) = f(L) (so that it has a contin- 
uous 2L-periodic extension), then f is uniquely determined by its Fourier 
coefficients [101, Theorem 2.4], even though the Fourier series may not 
converge to f(x) for some z. 


2.4 Fourier Transforms 


Suppose now that f is a complex-valued function defined on the whole 
real line. The Fourier transform of f is the function F : R — C given by 


F(y) =| f(a)e* da. (2.6) 

Given F, the Fourier Inversion Formula tells us how to get back to f(x): 
fla) = 52 [Foe (2.7) 

t)= 57 ns yje y. ; 


The function f(x) is sometimes called the inverse Fourier transform (IFT) 
of F(y). Note that the formulas in Equations (2.6) and (2.7) are nearly 
identical. Because of this, the terminology in other texts may differ from 
ours. As was the case with Fourier series, we have again swept several issues 
under the rug for now. We have not specified the properties of the function 
f that would guarantee the existence of the integrals in Equation (2.6); 
indeed, we have not said which definition of integration we must use. Even 
when we require that f be sufficiently well behaved, the Fourier transform 
function F may not be, and so the inversion formula in Equation (2.7) 
may require some interpretation. The functions f(x) and F(y) are called 
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a Fourier-transform pair. The definitions of the FT and IFT just given 
may differ slightly from the ones found elsewhere; our definitions are those 
of Bochner and Chandrasekharan [13] and Twomey [156]. The differences 
are minor and involve only the placement of the quantity 27 and of the 
minus sign in the exponent. One sometimes sees the Fourier transform of 
the function f denoted f , but we shall not use that notation here. 


2.5 Basic Properties of the Fourier Transform 


In this section we present the basic properties of the Fourier transform. 
Proofs of these assertions are left as exercises. 

Let u(x) be the Heaviside function; that is, u(x) = 1 if x > 0, and 
u(x) = 0 otherwise. Let y4() be the characteristic function of the interval 
[—A, A]; that is, xa (x) = 1 for x in [—A, A] and y4(x) = 0 otherwise. Let 
sgn(x) be the sign function; that is, sgn(x) = 1 if x > 0, and sgn(x) = —1 
if x < 0. The following are basic properties of the Fourier transform. 


e Symmetry: The FT of the function F(x) is 27 f(—y). For example, 
the FT of the function f(a) = zata) is xo(y), so the FT of g(x) = 
xalg) is G(q) = r2, 


e Conjugation: The FT of f (=x) is F(9). 


Scaling: The FT of f (ax) is raf (t) for any nonzero constant a. 


Shifting: The FT of f(x — a) is ê F(q). 


e Modulation: The FT of f(x) cos(yox) is $[F(y +0) + F(y — 0))]- 


e Differentiation: The FT of the nth derivative, f™ (x), is 
(—iy)"F(7). The IFT of F™® (7) is (éx)" f (x). 


e Convolution in z: Let f, F, g,G and h, H be FT pairs, with 
= f fogle- vay, (2.8) 


so that h(x) = (f * g)(x) is the convolution of f(a) and g(x). Then 
H(y) = F(y)G(9). For example, if we take g(x) = f(—x), then 


)= | Fetis = | FFU wey = rye) 
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is the autocorrelation function associated with f(x) and 


H(y) = |F (7)? = Re(7) 2 0 
is the power spectrum of f(x). 


e Convolution in q: Let f, F, g,G and h, H be FT pairs, with h(a) = 
f()g(x). Then H(7) = 3 (F * G)(q). 


Ex. 2.1 Let F(y) be the FT of the function f(x). Use the definitions of the 
FT and IFT given in Equations (2.6) and (2.7) to establish the following 
basic properties of the Fourier transform operation listed above. To establish 
the convolution formula calculate H(y) using Equation (2.8) and switch the 
order of integration. 


2.6 Some Fourier-Transform Pairs 


The exercises in this section introduce the reader to several Fourier- 
transform pairs. 


Ex. 2.2 Show that the FT of the function f(x) = u(a)e—® is F(y) = 
Ir for every positive constant a, where u(x) is the Heaviside function. 


Ex. 2.3 Show that the FT of f(x) = xa(a) is F(y) = gant) Similarly, 


show that the IFT of the function F(y) = yr(y) is f(a) = SBE. 


TNTE 


Ex. 2.4 Show that the IFT of the function F(y) = 2i/y is f(x) = sgn(z). 
Hint: Write the formula for the inverse Fourier transform of F (y) as 


1 peck ; pte og 
f(x) = =J 2 eos yady — a = sin yzdy, 
2T Joo 7 2m T 


which reduces to 


E a! 

f(x) = if — sin yady, 
TJ-œ 7Y 

since the integrand of the first integral is odd. For x > 0 consider the 

Fourier transform of the function y.(t). For x < 0 perform the change of 

variables u = =z. 


Generally, the functions f(x) and F(y) are complex-valued, so that we 
may speak about their real and imaginary parts. The next exercise explores 
the connections that hold among these real-valued functions. 
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Ex. 2.5 Let f(x) be arbitrary and F'(y) its Fourier transform. Let F(y) = 
R(y) +1X(y), where R and X are real-valued functions, and similarly, let 
f(a) = fi(x) + ifo(x), where fı and fə are real-valued. Find relationships 
between the pairs R,X and fi,fa. 


Definition 2.1 We define the even part of f(x) to be the function 


and the odd part of f(x) to be 
po- Heda few 


define Fe and F, similarly for F the FT of f. 


Ex. 2.6 Show that F(y) is real-valued and even if and only if f(a) is real- 
valued and even. 


Definition 2.2 We say that f is a causal function if f(x) = 0 for all 
x<0. 


Definition 2.3 The function X is the Hilbert transform of function R if 


Let ee 
TJ Y-A 
Ex. 2.7 Let F(y) = R(y) + iX(y) be the decomposition of F into its real 
and imaginary parts. Show that, if f is causal, then R and X are related; 
specifically, show that X is the Hilbert transform of R. Hint: If f(x) = 0 
for x < 0 then f(x)sgn(x) = f(a). Apply the convolution theorem, then 
compare real and imaginary parts. 


Definition 2.4 When the Fourier transform function F'(y) is nonzero only 
within a bounded interval [-T,T], we say that F is support-limited, and f 
is T -band-limited. 


Ex. 2.8 Let f(x), F(y) and g(x), G(y) be Fourier transform pairs. Use the 
conjugation property of Fourier transforms and convolution to establish the 
Parseval-Plancherel Equation 


(f.9) = J sedr = — f FE. (2.9) 


An important particular case of the Parseval-Plancherel Equation is 


IAP = (6A = f EPa == f IFO, (2.10) 
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Ex. 2.9 The one-sided Laplace transform (LT) of f is F given by 


= f 7 aed. 


Compute F(z) for f(x) = u(x), the Heaviside function. Compare F(—iy) 
with the FT of u. 


Ex. 2.10 Show that the Fourier transform of f(z) = e~*® is F(7) = 


Ve- (=), Hints: Calculate the derivative F'(y) by differentiating under 
the neural sign in the definition of F and integrating by parts. Then solve 
the resulting differential equation, obtaining 


F(y) = Kes)", 


for some constant K to be determined. To determine K, use the Parseval- 
Plancherel Equation (2.10) and the change of variables t = 2a?x to write 


2 2a? x? 1 sae, 
|f(x)| dz = e dz = dae € 2a dt, 


from which it follows that K = ve 


2.7 Dirac Deltas 


We saw earlier that the F(y) = yr(y) has for its inverse Fourier trans- 
form the function f(x) = $=; note that f(0) = E and f(x) = 0 for the 
first time when Tg = 7 or x = £. For any P-band-limited function g(x) we 
have G(y) = G(y)xr(q), so that, for any £o, we have 


g(zo) = f M EO 


ate T(x — zo) 


We describe this by saying that the function f(x) = sinTe has the sifting 
property for all T-band-limited functions g(x). 

As T grows larger, f(0) approaches +00, while f(x) goes to zero for 
x #0. The limit is therefore not a function; it is a generalized function 
called the Dirac delta function at zero, denoted d(x). For this reason the 
function f(x) = sinte is called an approximate delta function. The FT of 
(x) is the function F'(y) = 1 for all y. The Dirac delta function (a) enjoys 
the sifting property for all appropriate g(x); that is, 


g{to)} = T g(x)ó(x — xo)dz. 


—Co 
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Describing which functions g(x) are appropriate is part of the theory of 
generalized functions and is beyond the scope of this text. It follows from 
the sifting and shifting properties that the FT of 6(a — xo) is the function 
eo, 


The formula for the inverse FT now says 


iS ea 2.11 
@)= 5 h (2.11) 
If we try to make sense of this integral according to the rules of calculus we 
get stuck quickly. The problem is that the integral formula doesn’t mean 
quite what it does ordinarily and the 6(x) is not really a function, but 
an operator on functions; it is sometimes called a distribution. The Dirac 
deltas are mathematical fictions, not in the bad sense of being lies or fakes, 
but in the sense of being made up for some purpose. They provide helpful 
descriptions of impulsive forces, probability densities in which a discrete 
point has nonzero probability, or, in array processing, objects far enough 
away to be viewed as occupying a discrete point in space. 

We shall treat the relationship expressed by Equation (2.11) as a formal 
statement, rather than attempt to explain the use of the integral in what 
is surely an unconventional manner. 

If we move the discussion into the y domain and define the Dirac delta 
function 6(y) to be the FT of the function that has the value + for all 
x, then the FT of the complex exponential function are te is 0(¥ — Yo), 
visualized as a ’spike” at yo, that is, a generalized function that has the 
value +00 at y = yo and zero elsewhere. This is a useful result, in that it 
provides the motivation for considering the Fourier transform of a signal 
s(t) containing hidden periodicities. If s(t) is a sum of complex exponentials 
with frequencies —7,,, then its Fourier transform will consist of Dirac delta 
functions ô(y — Yn). If we then estimate the Fourier transform of s(t) from 
sampled data, we are looking for the peaks in the Fourier transform that 
approximate the infinitely high spikes of these delta functions. 


Ex. 2.11 Use the fact that sgn(x) = 2u(x) — 1 and Exercise 2.4 to show 
that f(x) = u(x) has the FT F(y) = i/y + 70(9). 


Ex. 2.12 Let f,F be a FT pair. Let g(x) = ite f(y)dy. Show that the 
FT of g(x) is G(y) = TF (0)ô(y) + zo, Hint: For the Heaviside function 


u(x) we have 


[tenay= f tone- dy. 
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2.8 Convolution Filters 


In many remote-sensing problems we want values of a function f(z), 
but are only able to measure values of another function, h(x), related to 
f(x) in some way. For example, suppose that x is time and f(x) represents 
what a speaker says into a telephone. The phone line distorts the signal 
somewhat, often attenuating the higher frequencies. What the person at 
the other end hears is not f(x), but a related signal function, h(a). For 
another example, suppose that f(x,y) is a two-dimensional picture viewed 
by someone with poor eyesight. What that person sees is not f(x,y) but 
h(a, y), a distorted version of the true f(x,y). In both examples, our goal 
is to recover the original undistorted signal or image. To do this, it helps 
to model the distortion. Convolution is a useful tool for this purpose. 

Often, the function h(x) has Fourier transform 


so that h(a) is the convolution of the desired function f(x) with another 
function g(x). The function G(y) describes the effects of the measuring sys- 
tem, the telephone line in our first example, or the weak eyes in the second 
example, or the refraction of light as it passes through the atmosphere, in 
optical imaging. If we can use our measurements of h(x) to estimate H (y) 
and if we have some knowledge of the system distortion function, that is, 
some knowledge of G(y) itself, then there is a chance that we can estimate 
F'(y), and thereby estimate f(x). 

If we apply the Fourier Inversion Formula to H(7) = F'(y)G(y), we get 


1 


~ On 


h(a) [rae tran, (2.12) 
The function h(x) that results is h(x) = (f * g)(x), the convolution of the 
functions f(x) and g(x), with the latter given by 


ole) = = f Gye, 


Note that, if f(x) = d(x), then h(x) = g(a). In the image processing 
example, this says that, if the true picture f is a single bright spot, then 
the blurred image h is g itself. For that reason, the function g is called the 
point-spread function of the distorting system. 

Convolution filtering refers to the process of converting any given func- 
tion, say f(x), into a different function, say h(x), by convolving f(x) with a 
fixed function g(x). Since this process can be achieved by multiplying F (y) 
by G(y) and then inverse Fourier transforming, such convolution filters are 
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studied in terms of the properties of the function G(y), known in this con- 
text as the system transfer function, or the optical transfer function (OTF); 
when y is a frequency, rather than a spatial frequency, G(7) is called the 
frequency-response function of the filter. The magnitude function |G(7)| is 
called the modulation transfer function (MTF). The study of convolution 
filters is a major part of signal processing. Such filters provide both rea- 
sonable models for the degradation that signals undergo, and useful tools 
for reconstruction. For an important example of the use of filtering, see 
Chapter 27 on Reverberation and Echo-Cancellation. 

Let us rewrite Equation (2.12), replacing F(7) with its definition, as 
given by Equation (2.6). Then we have 


h(a) = J (+ J fE) tdt) G()e dy. (2.13) 


Interchanging the order of integration, we get 


h(x) = T rol / (yer) ay) at, (2.14) 


The inner integral is g(x — t), so we have 


na) = | FOIE- bat (2.15) 


this is the definition of the convolution of the functions f and g. 

If we know the nature of the blurring, then we know G(y), at least 
approximately. We can try to remove the blurring by taking measurements 
of h(a), estimating H(7) = F(y)G(y), dividing these numbers by the value 
of G(y), and then inverse Fourier transforming. The problem is that our 
measurements are always noisy, and typical functions G(7) have many zeros 
and small values, making division by G(y) dangerous, except for those y 
where the values of G(7) are not too small. These latter values of y tend to 
be the smaller ones, centered around zero, so that we end up with estimates 
of F(y) itself only for the smaller values of y. The result is a low-pass 
filtering of the object f(x). 

To investigate such low-pass filtering, we suppose that G(y) = 1, for 
ly| < T, and G(y) = 0, otherwise. Then the filter is called the ideal T- 
low-pass filter. In the far-field propagation model, the variable x is spatial, 
and the variable y is spatial frequency, related to how the function f(x) 
changes spatially, as we move x. Rapid changes in f(a) are associated with 
values of F (y) for large y. For the case in which the variable x is time, the 
variable y becomes frequency, and the effect of the low-pass filter on f(x) 
is to remove its higher-frequency components. 

One effect of low-pass filtering in image processing is to smooth out 
the more rapidly changing features of an image. This can be useful if these 
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features are simply unwanted oscillations, but if they are important detail, 
such as edges, the smoothing presents a problem. Restoring such wanted 
detail is often viewed as removing the unwanted effects of the low-pass filter- 
ing; in other words, we try to recapture the missing high-spatial-frequency 
values that have been zeroed out. Such an approach to image restoration 
is called frequency-domain extrapolation. How can we hope to recover these 
missing spatial frequencies, when they could have been anything? To have 
some chance of estimating these missing values we need to have some prior 
information about the image being reconstructed. 


2.9 <A Discontinuous Function 
Consider the function f(x) = 54, for |x| < A, and f(x) = 0, otherwise. 
The Fourier transform of this f(x) is 


Fo) = 2, 


for all real y 4 0, and F(0) = 1. Note that F (y) is nonzero throughout the 
real line, except for isolated zeros, but that it goes to zero as we go to the 
infinities. This is typical behavior. Notice also that the smaller the A, the 
slower F (y) dies out; the first zeros of F (y) are at |y| = 4, so the main 
lobe widens as A goes to zero. The function f(x) is not continuous, so its 
Fourier transform cannot be absolutely integrable. In this case, the Fourier 
Inversion Formula must be interpreted as involving convergence in the L? 
norm. 


2.10 Shannon’s Sampling Theorem 


As one might expect, there are connections between Fourier series and 
Fourier transforms, and several different ways to establish these connec- 
tions. I believe the simplest way is to use Shannon’s Sampling Theorem. 

When the Fourier transform function F(y) is nonzero only within a 
bounded interval |-T,T], we say that F is support-limited, and f is then 
said to be T-band-limited. Then F has a Fourier series and the Fourier 
coefficients are 


1 fF one 
= — Sh Y 
Cn = op i F(yje'T Idy. (2.16) 
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Comparing Equations (2.7) and (2.16), we see that 


T (nt 
a= Fh (T) 
This tells us that whenever F is determined by its Fourier coefficients, both 
f and F are determined by the values of the inverse Fourier transform 
function f at the infinite set of points x = F. 

The Fourier coefficients cn and the inverse Fourier transform function f 
play similar roles. When F is support-limited, we attempt to represent F as 
an infinite sum of the complex exponential functions e’7 and the cn are 
the complex weights associated with each of these exponential functions. 
More generally, when F may not be support-limited, we attempt to express 
F(y) as a sum (an integral) over x of all the complex exponential functions 
e’*7, and the complex numbers f(x) are the weight associated with each 
exponential function. 

In many signal-processing applications the variable x is time and de- 
noted t, while the variable y is frequency, and denoted w. Then Shannon’s 
Sampling Theorem says that, whenever there is a bound on the absolute 
value of the frequencies involved in the function f, we can reconstruct f 
completely from values (or samples) of f at an infinite discrete set of values 
of x whose spacing depends on the bound on the frequencies; the higher 
the bound, the smaller the spacing between samples. When our sample 
spacing is too large, we get aliasing. Aliasing is what results in the familiar 
“strobe-light” effect and why the wagon wheels in cowboy movies appear to 
revolve backwards. 

If F (y) is supported on the interval [-T’, I], then F and f are completely 
determined by the values of f(a) at the infinite set of points x = 44. The 
spacing A = ¢ is called the Nyquist spacing. 

Ex. 2.13 Let T = 7, so that A = 1, fm = f(m), and gm = g(m). Use 
the orthogonality of the functions e""’ on [—1,7] to establish Parseval’s 
Equation: 


T 


9) =Y f= | POCO, (2.17) 
from which it follows that 
(A= f WFo)Pay/2. 


—oo 


Ex. 2.14 Let f(x) be defined for all real x and let F(y) be its FT. Let 


g(c)= Ý flw+2nk), 


k=—0o 
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assuming the sum exists. Show that g is a 2r-periodic function. Compute 
its Fourier series and use it to derive the Poisson summation formula: 


5 fenk) = = X F(n). 


k=— 0 n=— 00 


2.11 What Shannon Does Not Say 


It is important to remember that Shannon’s Sampling Theorem tells us 
that the doubly infinite sequence of values {f(nA)}°° _.. is sufficient to re- 
cover exactly the function F(y) and, thereby, the function f(x). Therefore, 
sampling at the rate of twice the highest frequency (in Hertz) is sufficient 
only when we have the complete doubly infinite sequence of samples. Of 
course, in practice, we never have an infinite number of values of anything, 
so the rule of thumb expressed by Shannon’s Sampling Theorem is not 
valid. Since we know that we will end up with only finitely many samples, 
each additional data value is additional information. There is no reason to 
stick to the sampling rate of twice the highest frequency. 


2.12 Inverse Problems 


In this section we introduce the concept of an inverse problem, using 
Fourier series to solve a heat-conduction problem. Many of the problems we 
study in applied mathematics are direct problems. For example, we imagine 
a ball dropped from a building of known height h and we calculate the time 
it takes for it to hit the ground and the impact velocity. Once we make cer- 
tain simplifying assumptions about gravity and air resistance, we are able 
to solve this problem easily. Using his inverse-square law of universal grav- 
itation, Newton was able to show that planets move in ellipses, with the 
sun at one focal point. Generally, direct problems conform to the usual 
flow of time and seek the effects due to known causes. Problems we call 
inverse problems go the other way, seeking the causes of observed effects; 
we measure the impact velocity to determine the height h of the build- 
ing. Newton solved an inverse problem when he determined that Kepler’s 
empirical laws of planetary motion follow from an inverse-square law of 
universal gravitation. 
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In each of the examples of remote sensing just presented in Chapter 1 
we have measured some effects and want to know the causes. In x-ray to- 
mography, for example, we observe that the x-rays that passed through the 
body of the patient come out weaker than when they went in. We know that 
they were weakened, or attenuated, because they were partially absorbed 
by the material they had to pass through; we want to know precisely where 
the attenuation took place. This is an inverse problem; we are trying to go 
back in time, to uncover the causes of the observed effects. 

Direct problems have been studied for a long time, while the theory of 
inverse problems is still being developed. Generally speaking, direct prob- 
lems are easier than inverse problems. Direct problems, at least those cor- 
responding to actual physical situations, tend to be well-posed in the sense 
of Hadamard, while inverse problems are often ill-posed. A problem is said 
to be well-posed if there is a unique solution for each input to the problem 
and the solution varies continuously with the input; roughly speaking, small 
changes in the input lead to small changes in the solution. If we vary the 
height of the building slightly, the time until the ball hits the ground and 
its impact velocity will change only slightly. For inverse problems, there 
may be many solutions, or none, and slight changes in the data can cause 
the solutions to differ greatly. In [7] Bertero and Boccacci give a nice il- 
lustration of the difference between direct and inverse problems, using the 
heat equation. 

Suppose that u(x,t) is the temperature distribution for x in the interval 
(0, a] and t > 0. The function u(x,t) satisfies the heat equation 


u 1 du 
ðr? D Ot’ 
where D > 0 is the thermal conductivity. In addition, we adopt the bound- 


ary conditions u(xz,0) = f(x), and u(0,t) = u(a,t) = 0, for all t. By 
separating the variables, and using Fourier series, we find that, if 


fla) = > fasia (2), 
n=1 


where a 
NTT 
n —— i d 4 
f A f(a) sin (Z ) £ 
then z 
u(x,t) = 5 fne PEt sin (= ; 
n=1 


a 


The direct problem is to find u(x, t), given f(x). Suppose that we know 
f(x) with some finite precision, that is, we know those Fourier coefficients 
fn for which | f,| > € > 0. Because of the decaying exponential factor, fewer 
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Fourier coefficients in the expansion of u(x,t) will be above this threshold, 
and we can determine u(x, t) with the same precision or better. The solution 
to the heat equation tends to be smoother than the input distribution. 

The inverse problem is to determine the initial distribution f(x) from 
knowledge of u(x,t) at one or more times t > 0. As we just saw, for any 
fixed time t > 0, the Fourier coefficients of u(x,t) will die off faster than 
the fn do, leaving fewer coefficients above the threshold of e. This means 
we can determine fewer and fewer of the fn as t grows larger. For t beyond 
some point, it will be nearly impossible to say anything about f(x). 

Once again, the proper interpretation of Equation (2.7) will depend on 
the properties of the functions involved. It may happen that one or both of 
these integrals will fail to be defined in the usual way and will be interpreted 
as the principal value of the integral [80]. 


2.13 Two-Dimensional Fourier Transforms 


The Fourier transform is also defined for functions of several real vari- 
ables f(a1,...,vx). The multidimensional FT arises in image processing, 
scattering, transmission tomography, and many other areas. In this section 
we discuss the extension of the definitions of the FT and IFT to functions 
of two real variables. 


2.13.1 The Basic Formulas 


For the complex-valued function f(x, y) of two real variables, the Fourier 
transformation is 


F(a.) = | | fave aeay, 


Just as in the one-dimensional case, the Fourier transformation that pro- 
duced F(a, 8) can be inverted to recover the original f(x,y). The Fourier 
Inversion Formula in this case is 


1 
fay) = 7 | / F(a, Bye 0? +) dadB. (2.18) 
T 
It is important to note that this procedure can be viewed as two one- 


dimensional Fourier inversions: First, we invert F(a, 8), as a function of, 
say, 6 only, to get the function of a and y 


1 ; 
gla.y) = z= | Flo, Beds; 
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second, we invert g(a, y), as a function of a, to get 


f(,y) = Z J sley) da. 


If we write the functions f(x,y) and F(a, 3) in polar coordinates, we ob- 
tain alternative ways to implement the two-dimensional Fourier inversion. 
We shall consider these other ways in Chapter 11, when we discuss the 
tomography problem of reconstructing a function f(x,y) from line-integral 
data. 


2.13.2 Radial Functions 


Now we consider the two-dimensional Fourier-transform pairs in polar 
coordinates. We convert to polar coordinates using (x,y) = r(cos 0, sin 0) 
and (a, 8) = p(cosw, sinw). Then 


F(p,w) = | f(r, Ae 0- rdirdO. (2.19) 
0 =T 


Say that a function f(x,y) of two variables is a radial function if £? + y? = 
x? + y? implies f(x,y) = f(x1,y1), for all points (x,y) and (x1, yı); that 
is, f(x,y) = g(,/2? + y?) for some function g of one variable. 


Ex. 2.15 Show that if f is radial then its FT F is also radial. Find the 
FT of the radial function f(x,y) = ———. Hints: Insert f(r,0) = g(r) in 


y z2+y2 
Equation (2.19) to obtain 


F(p,w) = i: J g(r)et™? 8 O—”) rdirdO 
0 =T 
or 


F(p,w) =| ra(r)| | etre cos(8—w) da] dr, 
0 =T 


Show that the inner integral is independent of w, and then use the fact that 


/ etr 089 19 = In Jo (rp), 


= 


with Jo the Oth order Bessel function, to get 


Fp.) = H(p) =2n f rglr)dolrp)dr 


The function H(p) is called the Hankel transform of g(r). Summarizing, 
we say that if f(x,y) is a radial function obtained using g then its Fourier 
transform F(a, 8) is also a radial function, obtained using the Hankel trans- 
form of g. 
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2.13.3 An Example 


For example, suppose that f(x,y) = 1 for \/z? + y? < R, and f(z,y) = 
0, otherwise. Then we have 


T R 
F(a, B) = J | e~ ilor cos 0+6r sine). ae dg, 
-r JO 


In polar coordinates, with a = pcos ġ and 8 = psin ġ, we have 


R T 
F(p,¢) = | k etr 89-4) dordr, 
0 -T 


The inner integral is well known; 


f (#0090 = ahlo), 
where Jo and J, denote the Oth order and nth order Bessel functions, 
respectively. Using the following identity 


| t" In—1(t)dt = 2” Jn (z), 
0 


we have gots 
T 
F(p, $) = Pa TPR): 


Notice that, since f(x,z) is a radial function, that is, dependent only on 
the distance from (0,0) to (x,y), its Fourier transform is also radial. 

The first positive zero of Jı (t) is around t = 4, so when we measure 
F at various locations and find F(p, ¢) = 0 for a particular (p, ¢), we can 
estimate R = 4/p. So, even when a distant spherical object, like a star, 
is too far away to be imaged well, we can sometimes estimate its size by 
finding where the intensity of the received signal is zero [101]. 

In her 1953 Nature paper with R. G. Gosling the British scientist 
Rosalind Franklin presented evidence she had obtained from x-ray scat- 
tering experiments that corroborated the double-helical structure of the 
DNA molecule proposed a short time previously by Crick and Watson. She 
showed mathematically that the scattering pattern from a helical structure 
would be described by the Bessel functions Jn and noted that the observed 
maximal intensities in her photographs corresponded to the zeros of these 
Bessel functions. 

According to Lightman [111], most historians of science who have stud- 
ied the work that led to the discovery of the structure of DNA agree that 
the contribution of Rosalind Franklin is understated in Watson’s account in 
his book [160]. In 1962 Francis Crick and James Watson shared the Nobel 
Prize in Physics with Maurice Wilkins of King’s College, London, who had 
worked with Franklin on DNA. Had she not died of cancer in 1958, it is 
plausible that Franklin, not Wilkins, would have shared the prize. 
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2.14 The Uncertainty Principle 


We saw earlier that the Fourier transform of the function f(z) = e7% 7 
is 


F(y) = VR Ca 
a 


This Fourier-transform pair illustrates well the general fact that the more 
concentrated f(x) is, the more spread out F(y) is. In particular, it is im- 
possible for both f and F to have bounded support. We prove the following 
inequality: 


f2?|f@)Pde f AIF O)Pay 1 
MOA Was TIFda = 4 


This inequality is the mathematical version of Heisenberg’s Uncertainty 
Principle. 

As we shall show in Chapter 19, the Cauchy-Schwarz Inequality holds 
in any vector space with an inner product. In the present situation, the 
Cauchy-Schwarz Inequality tells us that 


|S Fa] < f ioa f Iola) Pax, 


with equality if and only if g(x) = kf (x), for some scalar k. We will need 
this in the proof of the inequality (2.20). We’ll also need the Parseval- 
Plancherel Equation (2.9), as well as the fact that, for any two complex 
numbers z and w, we have 


(2.20) 


Led 
|zw| > 3 (ew + Zw). 
In addition, we assume that 


lim _(a(| f(a)? + |f(—a)l?) = 0, 


a+ 


so that, using integration by parts, we have 


[EOP] æ=- freas. 


The proof of Inequality (2.20) now follows: 


= [Peas f PROPA = = f let@ Pae f WPO)Pa 
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== | latte Pde f Ife Pde > (feros (de) 


ae FO + Far) = 3( S AEO) 


=I fit@ra) =< f OPa f IP Pa: 


This completes the proof of Inequality (2.20). 
The significance of this inequality is made evident when we reformulate 
it in terms of the variances of probability densities. Suppose that 


[roa f Fa =n, 


so that we may view |f (x)|? and +|F(y)|? as probability density functions 


associated with random tables: X and Y, respectively. From probability 
theory we know that the expected values E(X) and E(Y) are given by 


m = E(X) = I lf (a) Pda 
and i 
M = EY) = = [EPa 


Let 
g(x) = f(a + m), 


so that the Fourier transform of g(x) is 
G) = F(y + M), 


Then, |g(x)|? = |f (x + m)|? and |G(7)|? = |F (y + M)|?; we also have 


J elg(a)Pax =o 


/ AG) = 0. 


The point here is that we can assume that m = 0 and M = 0. Consequently, 
the variance of X is 


and 


ne oe felipe 
and the variance of Y is 


var(¥) = = | -?1FO)Pay. 
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The variances measure how spread out the functions |f(x)|? and |F(y)|? 
are around their respective means. From Inequality (2.20) we know that 


the product of these variances is not smaller than i. 


Ex. 2.16 Show, by examining the proof of Inequality (2.20), that if the 
inequality is an equation for some f then f'(x) = kaf(x), so that f(x) = 
ene for some a > 0. Hint: What can be said when Cauchy’s Inequality 
is an equality? 


2.15 Best Approximation 


The basic problem here is to estimate F'(y) from finitely many values of 
f(a), under the assumption that F'(y) = 0 for |y| > T, for some T > 0. Since 
we do not have all of f(x), the best we can hope to do is to approximate 
Fy) in some sense. To help us understand how best approximation works, 
we consider the orthogonality principle. 


2.15.1 The Orthogonality Principle 


Imagine that you are standing and looking down at the floor. The point 
B on the floor that is closest to the tip of your nose, which we label F, 
is the unique point on the floor such that the vector from B to any other 
point A on the floor is perpendicular to the vector from B to F; that is, 
FB-AB=0. This is a simple illustration of the orthogonality principle. 

When two vectors are perpendicular to one another, their dot product 
is zero. This idea can be extended to functions. We say that two functions 
Fy) and G(y) defined on the interval [-T,T] are orthogonal if 


F 
/ F(y)O@)dy = 0. 


-T 


Suppose that Gn (y), n = 0,..., N — 1, are known functions, and 


for any coefficients an. We want to minimize the approximation error 


T 
J IF) — A(y) Pay, 


-T 
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over all coefficients an. Suppose that the best choices are an = bn. The 
orthogonality principle tells us that the best approximation 


N-1 
By) = $ bnGn(r) 
n=0 


is such that the function F(y) — B(y) is orthogonal to A(y) — B(y) for 
every choice of the an. 

Suppose that we fix m and select an = bn, for n Am, and am = bm +1. 
Then we have 


F 
Í (F(7) — B())Gmlyay = 0. (2.21) 


We can use Equation (2.21) to help us find the best bn. 
From Equation (2.21) we have 


r N-1 r 
J F(y)Gm(y)dy = X` bn if Gn(7)Gm(V)dy. 
n=0 -PT 


-T 


Since we know the Gn (y), we know the integrals 


r 
I : Gr(y)Gm(y) dy. 


If we can learn the values 


r 
i F(y)Gm(y)dy 


-T 


from measurements, then we simply solve a system of linear equations to 
find the bn. 


2.15.2 An Example 


Suppose that we have measured the values f(x,,), for n = 0,..., N — 1, 
where the x, are arbitrary real numbers. Then, from these measurements, 
we can find the best approximation of F(y) of the form 


N-1 
A(7) = 5 anGn(7), 
n=0 


if we select Ga (y) = e’7. 
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2.15.3 The DFT as Best Approximation 


Suppose now that our data values are f(An), for n = 0,1,...,N — 1, 
where we have chosen A = $. We can view the DFT as a best approxima- 
tion of the function F'(y) over the interval |-T,T], in the following sense. 


Consider all functions of the form 


where the best coefficients a, = bn are to be determined. Now select those 
b, for which the approximation error 


T 
/ IF(y) — A(q) Pay 


-T 


is minimized. Then it is easily shown that these optimal b, are precisely 
bn = Af(nA), 
for n = 0,1,..., N — 1. 


Ex. 2.17 Show that bn = Af (nA), forn =0,1,..., N — 1, are the optimal 
coefficients. 


The DFT estimate is reasonably accurate when N is large, but when 
N is not large there are usually better ways to estimate F (y), as we shall 
see. 

In Figure 2.1, the real-valued function f(x) is the solid-line figure in both 
graphs. In the bottom graph, we see the true f(x) and a DFT estimate. The 
top graph is the MDFT estimator, the result of band-limited extrapolation, 
a technique for predicting missing Fourier coefficients that we shall discuss 
next. 


2.15.4 The Modified DFT (MDFT) 


We suppose, as in the previous subsection, that F(y) = 0, for |y| >T, 
and that our data values are f(nA), for n = 0,1,...,N — 1. It is often 
convenient to use a sampling interval A that is smaller than = in order 
to obtain more data values. Therefore, we assume now that A < =. Once 


again, we seek the function of the form 
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FIGURE 2.1: The non-iterative band-limited extrapolation method 
(MDFT) (top) and the DFT (bottom) for N = 64, 30 times over-sampled 
data. 


defined for |y| < T, for which the error measurement 
T 
J IEM- AciPen 


is minimized. 
In the previous example, for which A = Ẹ, we have 


r 
| ellr— MAY gay =0, 
Sp 


for m Æ n. As the reader will discover in doing Exercise 2.17, this greatly 
simplifies the system of linear equations that we need to solve to get the 
optimal bn. Now, because A # £, we have 


f dnd = sin((n — m)AT) 
-r 1 mn-m)A ’ 


which is not zero when n 4 m. This means that we have to solve a more 
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complicated system of linear equations in order to find the bn. It is impor- 
tant to note that the optimal bn are not equal to Af(nA) now, so the DFT 
is not the optimal approximation. The best approximation in this case we 
call the modified DFT (MDFT), given by 


N-1 
Fuprr(y) = xr(7) Xo bne ^T, (2.22) 


n=0 


where xr(y) is the function that is one for |y| < T and zero otherwise. 


2.15.5 The PDFT 


In the previous subsection, the functions A(y) were defined for |y| < T. 
Therefore, we could have written them as 


N-1 


A(y) = xr(7) So ane ^, 


n=0 


The factor yr(y) serves to incorporate into our approximating function our 
prior knowledge that F (y) = 0 outside the interval |-T,T]. What can we 
do if we have additional prior knowledge about the broad features of F (y) 
that we wish to include? 

Suppose that P(y) > 0 is a prior estimate of |F (y)|. Now we approxi- 
mate F(y) with functions of the form 


N-1 


C(A) = PO) X enes, 


n=0 


As we shall see in Chapter 25, the best choices of the cn are the ones that 
satisfy the equations 


N-1 
f(mA) = $ enp((n — m)A), (2.23) 
n=0 
for m = 0,1,...,N — 1, where 
1 fF l 
s —iny 
pla) = z f Poean 


is the inverse Fourier transform of the function P(y). This best approxima- 
tion we call the PDFT [23, 24, 26]. The use of the PDFT was illustrated 
in Chapter 1, in the reconstruction of a simulated head slice. 
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2.16 Analysis of the MDFT 


Let our data be f (£m), m = 1, ..., M, where the £m are arbitrary values 
of the variable x. If F (y) is zero outside [-T,T], then minimizing the energy 
over [—-I’, I’] subject to data consistency produces an estimate of the form 


M 
Fuprr(7) = xr(7) >> bm exp(itm), 


m=1 


with the bm satisfying the equations 


3 


M 7 
Fen) = Y by Am = tn) 


(Lm — Ln) 


sin(['\(am—2n)) 


we call a sinc 
T(Lm—£n) 


for n = 1,...,M. The matrix Sp with entries 
matrix. 


2.16.1 Ejigenvector Analysis of the MDFT 


Although it seems reasonable that incorporating the additional infor- 
mation about the support of F (y) should improve the estimation, it would 
be more convincing if we had a more mathematical argument to make. 
For that we turn to an analysis of the eigenvectors of the sinc matrix. 
Throughout this subsection we make the simplification that £n = n. 


Ex. 2.18 The purpose of this exercise is to show that, for an Hermitian 
nonnegative-definite M by M matrix Q, a norm-one eigenvector u! of Q as- 
sociated with its largest eigenvalue, 41, maximizes the quadratic form a'Qa 
over all vectors a with norm one. Let Q = ULU' be the eigenvector decom- 
position of Q, where the columns of U are mutually orthogonal eigenvectors 
u” with norms equal to one, so that UU = I, and L = diag{\y,..., Am} is 
the diagonal matrix with the eigenvalues of Q as its entries along the main 
diagonal. Assume that 1 > `2 >... > Am. Then maximize 


M 
a'Qa = 5 àn |atu”|?, 


n=1 
subject to the constraint 
M 
ala=alU'Ua= 5 jatu™|? = 1. 
n=1 


Hint: Show a'Qa is a conver combination of the eigenvalues of Q. 
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Ex. 2.19 Show that, for the sinc matriz Q = Sp, the quadratic form aiQa 
in the previous exercise becomes 


1 


F M R 2 
a Spa = — > ane"! dy. 
TW A 
n=1 


2 


Show that the norm of the vector a is the square root of the integral 


M 
À ane 
n=1 


2 
dy. 


1 T 
QTE 3 ze 


Ex. 2.20 For M = 30 compute the eigenvalues of the matrix Sp for various 
choices of T, such as T = Ẹ, for k = 2,3,...,10. For each k arrange the 
set of eigenvalues in decreasing order and note the proportion of them that 
are not near zero. The set of eigenvalues of a matrix is sometimes called 
its eigenspectrum and the nonnegative function xr(y) is a power spectrum; 
here is one time in which different notions of a spectrum are related. 


2.16.2 The Eigenfunctions of Sr 


Suppose that the vector u! = (ut,...,u4,;)? is an eigenvector of Sp 


corresponding to the largest eigenvalue, \,. Associate with u! the eigen- 
function 


T 


M 
Ua) = Dake 
n=1 
Then 


r Tw 
=f Paf Pa 
-T -T 

and U!(y) is the function of its form that is most concentrated within the 
interval [-T,T]. 

Similarly, if u™ is an eigenvector of Sp associated with the smallest 
eigenvalue Àm, then the corresponding eigenfunction U™ (7) is the function 
of its form least concentrated in the interval |-T,T]. 


Ex. 2.21 On the interval |y| < n plot the functions |U™ (y)| corresponding 
to each of the eigenvectors of the sinc matrix Sp. Pay particular attention 
to the places where each of these functions is zero. 
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The eigenvectors of Sr corresponding to different eigenvalues are orthog- 
onal, that is (u’)'u” = 0 if m is not n. We can write this in terms of 
integrals: 


[ wore =o 


if m is not n. The mutual orthogonality of these eigenfunctions is related 
to the locations of their roots, which were studied in the previous exercise. 

Any Hermitian matrix Q is invertible if and only if none of its eigenval- 
ues is zero. With Àm and u™, m = 1,..., M, the eigenvalues and eigenvectors 
of Q, the inverse of Q can then be written as 


Qi= (1/àı)u! (ut)! +... + (1/Am)u (u Y. 


Ex. 2.22 Show that the MDFT estimator given by Equation (2.22) 
Fmprr(y) can be written as 


Lurta u”), 


Fmuprr(Y) = xr(7) XV. 


Ms 


m=1 


where d = (f(1), f(2),..., f(M))” is the data vector. 


Ex. 2.23 Show that the DFT estimate of F(y), restricted to the interval 
[-I’,T], is 


Fprr(y (y) Sou mid U™(y). 


m=i 


Hint: Use the fact that I = UU". 


From these two exercises we can learn why it is that the estimate Fm prr(y) 
resolves better than the DFT. The former makes more use of the eigenfunc- 
tions U™ (y) for higher values of m, since these are the ones for which Am is 
closer to zero. Since those eigenfunctions are the ones having most of their 
roots within the interval [-T,T], they have the most flexibility within that 
region and are better able to describe those features in F(y) that are not 
resolved by the DFT. 
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3.1 Chapter Summary 


A basic problem in remote sensing is to determine the nature of a dis- 
tant object by measuring signals transmitted by or reflected from that 
object. If the object of interest is sufficiently remote, that is, is in the far 
field, the data we obtain by sampling the propagating spatio-temporal field 
is related, approximately, to what we want by Fourier transformation. In 
this chapter we present examples to illustrate the roles played by Fourier 
series and Fourier coefficients in the analysis of remote sensing and signal 
transmission. We use these examples to motivate several of the computa- 
tional problems we shall consider in detail later in the text. We also discuss 
two inverse problems involving the Laplace transform. 

We consider here a common problem of remote sensing of transmitted or 
reflected waves propagating from distant sources. Examples include optical 
imaging of planets and asteroids using reflected sunlight, radio-astronomy 
imaging of distant sources of radio waves, active and passive sonar, radar 
imaging using microwaves, and infrared (IR) imaging to monitor the ocean 
temperature. In such situations, as well as in transmission and emission 
tomography and magnetic-resonance imaging, what we measure are es- 
sentially the Fourier coefficients or values of the Fourier transform of the 
function we want to estimate. The image reconstruction problem then be- 
comes one of estimating a function from finitely many noisy values of its 
Fourier transform. 


3.2 Fourier Series and Fourier Coefficients 


We suppose that f : [—L, L] > C, and that its Fourier series converges 
to f(x) for all x in [—L, L]. In the examples in this chapter, we shall see 
how Fourier coefficients can arise as data obtained through measurements. 
However, we shall be able to measure only a finite number of the Fourier 
coefficients. One issue that will concern us is the effect on the estimation 
of f(x) if we use some, but not all, of its Fourier coefficients. 

Suppose that we have cn, as defined by Equation (2.5), for n = 
0,1,2,...,N. It is not unreasonable to try to estimate the function f(x) 
using the discrete Fourier transform (DFT) estimate, which is 


N 


forr(2) = 5 nËT?, 


n=0 
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When we know that f(x) is real-valued, and so c-n = Gr, we naturally 
assume that we have the values of cn for |n| < N. 


3.3 The Unknown Strength Problem 


In this example, we imagine that each point x in the interval [—L, L] 
is sending out a signal that is a complex-exponential-function signal, also 
called a sinusoid, at the frequency w, each with its own strength f(x); that 
is, the signal sent by the point x is 


fiet. 


In our first example, we imagine that the strength function f(x) is unknown 
and we want to determine it. It could be the case that the signals originate 
at the points x, as with light or radio waves from the sun, or are simply 
reflected from the points x, as is sunlight from the moon or radio waves 
in radar. Later in this chapter, we shall investigate a related example, in 
which the points x transmit known signals and we want to determine what 
is received elsewhere. 


3.3.1 Measurement in the Far Field 


Now let us consider what is received by a point P on the circumference 
of a circle centered at the origin and having large radius D. The point P 
corresponds to the angle 0 as shown in Figure 3.1; we use 0 in the interval 
[0, 7]. It takes a finite time for the signal sent from «x at time t to reach P, 
so there is a delay. 

We assume that c is the speed at which the signal propagates. Because 
D is large relative to L, we make the far-field assumption, which allows us 
to approximate the distance from x to P by D — x cos 0. Therefore, what 
P receives at time t from g is approximately what was sent from x at time 
t— 4(D — z cos0). 


Ex. 3.1 Show that, for any point P on the circle of radius D and any 
x #0, the distance from x to P is always greater than or equal to the 
far-field approximation D — x cos0, with equality if and only if 0 = 0 or 
O=T. 


At time t, the point P receives from x the signal 


flay t-e cos) __ eivlt- ED) f(g) ei Ee 
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FIGURE 3.1: Far-field measurements. 


Because the point P receives signals from all x in [—L, L], the signal that 


P receives at time t is 
een S 7 fe jees y 
Therefore, from measurements in the far field, we obtain the values 


F ie {20288 g 


w cos 0 NT (3.1) 


When 6 is chosen so that 


we have Cp. 


3.3.2 Limited Data 


Note that we will be able to solve Equation (3.1) for 0 if and only if we 
have 


In| < — 
TC 
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This tells us that we can measure only finitely many of the Fourier coeffi- 
cients of f(a). It is common in signal processing to speak of the wavelength 
of a sinusoidal signal; the wavelength associated with a given w and c is 


yar 
w 

Therefore we can measure 2N +1 Fourier coefficients, where N is the largest 
integer not greater than 2, which is the length of the interval [—L, L], 
measured in units of wavelength A. We get more Fourier coefficients when 
the product Lw is larger; this means that when L is small, we want w to be 
large, so that A is small and N is large. As we saw previously, using these 
finitely many Fourier coefficients to calculate the DFT reconstruction of 
f(a) can lead to a poor estimate of f(x), particularly when N is small. 

Consider the situation in which the points x are reflecting signals that 
are sent to probe the structure of an object described by the function f, 
as in radar. This relationship between the number Lw and the number of 
Fourier coefficients we can measure amounts to a connection between the 
frequency of the probing signal and the resolution attainable; finer detail 
is available only if the frequency is high enough. 

The wavelengths used in primitive early radar at the start of World War 
II were several meters long. Since resolution is proportional to aperture, 
that is, the length of the array measured in units of wavelength, antennas 
for such radar needed to be quite large. As Korner notes in [102], the general 
feeling at the time was that the side with the shortest wavelength would 
win the war. The cavity magnetron, invented during the war by British 
scientists, made possible microwave radar having a wavelength of 10 cm, 
which could then be mounted easily on planes. 


3.3.3 Can We Get More Data? 


As we just saw, we can make measurements at any points P in the 
far field; perhaps we do not need to limit ourselves to just those angles 
that lead to the cn. It may come as somewhat of a surprise, but from the 
theory of complex analytic functions we can prove that there is enough 
data available to us here to reconstruct f(x) perfectly, at least in principle. 
The drawback, in practice, is that the measurements would have to be free 
of noise and impossibly accurate. All is not lost, however. 


3.3.4 Measuring the Fourier Transform 


If 0 is chosen so that 
weosO = —nt 


c L’ 
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then our measurement gives us the Fourier coefficients Cn. But we can 
select any angle 0 and use any P we want. In other words, we can obtain 


the values 
jweos O Oy 
if f(a r, 


for any angle 0. With the change of variable 


w cos 0 
y= ; 
c 


we can obtain the value of the Fourier transform, 


L 
F(y) = j. "Faded, 


for any y in the interval |- 4, ©]. 
We are free to measure at any P and therefore to obtain values of F'(y) 
for any value of y in the interval [-£, €]. We need to be careful how we 


process the resulting data, however. 


3.3.5 Over-Sampling 


Suppose, for the sake of illustration, that we measure the far-field signals 
at points P corresponding to angles 0 that satisfy 


weosO = —nt 
c 2L’ 
instead of 
wcosh  =nr 
c © L 


Now we have twice as many data points and from these new measurements 


we can obtain 
dn = -f n f(a jet Edr, 


for |n| < 2N. We say now that our data is twice over-sampled. Note that 
we call it over-sampled because the rate at which we are sampling is higher, 
even though the distance between samples is shorter. The values dn are not 
simply more of the Fourier coeffcients of f. The question now is: What are 
we to do with these extra data values? 

The values d, are, in fact, Fourier coefficients, but not of f; they are 
Fourier coefficients of the function g : [—2L,2L] > C, where g(x) = f(x) 
for |x| < L, and g(x) = 0, otherwise. If we simply use the dn as Fourier 
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coefficients of the function g(x) and compute the resulting DFT estimate 
of g(x), 


this function estimates f(x) for |z| < L, but it also estimates g(x) = 0 for 
the other values of x in [—2L,2L]. When we graph gprr(x) for |z| < L 
we find that we have no improvement over what we got with the previous 
estimate fprr. The problem is that we have wasted the extra data by 
estimating g(x) = 0 where we already knew that it was zero. To make 
good use of the extra data we need to incorporate this prior information 
about the function g. The MDFT and PDFT algorithms provide estimates 
of f(x) that incorporate prior information. 


3.3.6 The Modified DFT 


The modified DFT (MDFT) estimate was first presented in [22]. For 
our example of twice over-sampled data, the MDF'T is defined for |x| < L 
and has the algebraic form 


fuprr(z) = 5 ane, (3.2) 


n=—2N 


for |2| < L. The coefficients an are not the dn. The an are determined by 
requiring that the function fmprr be consistent with the measured data, 
the dn. In other words, we must have 


L 
dn = / fuprr(a)e'22 "da. (3.3) 
-L 


When we insert fmprr(x) as given in Equation (3.2) into Equation (3.3) 
we get a system of 2N +1 linear equations in 2N +1 unknowns, the an. We 
then solve this system for the a, and use them in Equation (3.2). Figure 
2.1 shows the improvement we can achieve using the MDFT. The data used 
to construct the graphs in that figure was thirty times over-sampled. We 
note here that, had we extended f initially as a 2L-periodic function, it 
would be difficult to imagine the function g(x) and we would have a hard 
time figuring out what to do with the dn. 

In this example we measured twice as much data as previously. We 
can, of course, measure even more data, and it need not correspond to the 
Fourier coefficients of any function. The potential drawback is that, as we 
use more data, the system of linear equations that we must solve to obtain 
the MDFT estimate becomes increasingly sensitive to noise and round-off 
error in the data. It is possible to lessen this effect by regularization, but 
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not to eliminate it entirely. Regularization can be introduced here simply 
by multiplying by, say, 1.01, the entries of the main diagonal of the matrix 
of the linear system. This makes the matrix less ill-conditioned. 

In our example, we used the prior knowledge that f(x) = 0 for |x| > L. 
Now, we shall describe in detail the use of other forms of prior knowledge 
about f(x) to obtain reconstructions that are better than the DFT. 


3.3.7 Other Forms of Prior Knowledge 


As we just showed, knowing that we have over-sampled in our measure- 
ments can help us improve the resolution in our estimate of f(x). We may 
have other forms of prior knowledge about f(x) that we can use. If we know 
something about large-scale features of f(x), but not about finer details, 
we can use the PDFT estimate, which is a generalization of the MDFT. 
In Chapter 1 the PDFT was compared to the DFT in a two-dimensional 
example of simulated head slices. 

The MDFT estimator can be written as 


fuprr(2) es ane E, 


n=—2N 


We include the prior information that f(x) is supported on the interval 
[-L, L] through the factor xz (a). If we select a function p(x) > 0 that 
describes our prior estimate of the shape of |f (x)|, we can then estimate 
f(x) using the PDFT estimator, which, in this case of twice over-sampled 
data, takes the form 


feprr(z > bne 2E? 


2, 


As with the MDFT estimator, we determine the coefficients bn by requiring 
that fpprr(x) be consistent with the measured data. 

There are other things we may know about f(x). We may know that 
f(x) is nonnegative, or we may know that f(x) is approximately zero for 
most x, but contains very sharp peaks at a few places. In more formal 
language, we may be willing to assume that f(x) contains a few Dirac delta 
functions in a flat background. There are nonlinear methods, such as the 
maximum entropy method, the indirect PDFT (IPDFT), and eigenvector 
methods, that can be used to advantage in such cases; these methods are 
often called high-resolution methods. 
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3.4 Generalizing the MDFT and PDFT 


In our discussion so far the data we have obtained are values of the 
Fourier transform of the support-limited function f(x). The MDFT and 
PDFT can be extended to handle those cases in which the data we have 
are more general linear-functional values pertaining to f(x). 

Suppose that our data values are finitely many linear-functional values, 


L 
fice i _ fa)gnl@Fae 


for n = 1, ..., N, where the gn(x) are known functions. The extended MDFT 
estimate of f(x) is 


fuprr(&£ oD Am Gm(£ 


where the coefficients am are chosen so that fmprr is consistent with the 
measured data; that is, 


L 
dn =|. fmprr(£)gn(x)dz, 


for each n. To find the am we need to solve a system of N equations in N 
unknowns. 

The PDFT can be extended in a similar way. The extended PDFT 
estimate of f(x) is 


feprr(z OL bm Gm (x 


where, as previously, the coefficients bm are chosen by forcing the estimate 
of f(x) to be consistent with the measured data. Again, we need to solve 
a system of N equations in N unknowns to find the coefficients. 

For large values of N, setting up and solving the required systems of 
linear equations can involve considerable effort. If we discretize the func- 
tions f(x) and gn(x), we can obtain good approximations of the extended 
MDFT and PDFT using the iterative ART algorithm [142, 143]. 
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3.5 One-Dimensional Arrays 


In this section we consider the reversed situation in which the sources 
of the signals are the points on the circumference of the large circle and we 
are measuring the received signals at points of the x-axis. The objective is 
to determine the relative strengths of the signals coming to us from various 
angles. 

People with sight in only one eye have a difficult time perceiving depth 
in their visual field, unless they move their heads. Having two functioning 
ears helps us determine the direction from which sound is coming; blind 
people, who are more than usually dependent on their hearing, often move 
their heads to get a better sense of where the source of sound is. Snakes 
who smell with their tongues often have forked tongues, the better to detect 
the direction of the sources of different smells. In certain remote-sensing 
situations the sensors respond equally to arrivals from all directions. One 
then obtains the needed directionality by using multiple sensors, laid out 
in some spatial configuration called the sensor array. The simplest config- 
uration is to have the sensors placed in a straight line, as in a sonar towed 
array. 

Now we imagine that the points P = P(@) in the far field are the sources 
of the signals and we are able to measure the transmissions received at 
points x on the z-axis; we no longer assume that these points are confined 
to the interval [—L, L] . The P corresponding to the angle 0 sends f (@)e, 
where the absolute value of f(@) is the strength of the signal coming from 
P. We allow f(@) to be complex, so that it has both magnitude and phase, 
which means that we do not assume that the signals from the different 
angles are in phase with one another; that is, we do not assume that they 
all begin at the same time. 

In narrow-band passive sonar, for example, we may have hydrophone 
sensors placed at various points x and our goal is to determine how much 
acoustic energy at a specified frequency is coming from different directions. 
There may be only a few directions contributing significant energy at the 
frequency of interest, in which case f(@) is nearly zero for all but a few 
values of 6. 


3.5.1 Measuring Fourier Coefficients 


At time t the point x on the x-axis receives from P = P(0) what P sent 
at time t — (D — z cos 0) /c; so, at time t, x receives from P 


acai ale gC as cos 0 
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Since x receives signals from all the angles, what x receives at time t is 
T 
eiw(t—D/c) j HOU cos qo. 
0 


We limit the angle @ to the interval [0,7] because, in this sensing model, 
we cannot distinguish receptions from 0 and from 27 — @. 
To simplify notation, we shall introduce the variable u = cos 0. We then 


have 
du 


FT 


sin(0) = 1-v?, 


so that 


Now let g(u) be the function 


f (arccos(u 
g(u) = ee 
yl—u 
defined for u in the interval (—1, 1). Since 
T ee 1 ae 
| f (dere °° do = J glue “du, 
0 -1 
we find that, from our measurement at x, we obtain G(y), the value of the 
Fourier transform of g(u) at y, for 


= 
(6 


Since g(u) is limited to the interval (—1, 1), its Fourier coefficients are 


1 f ; 
an = z g(uje "du. 
aa 


Therefore, if we select x so that 


we have an. Consequently, we want to measure at the points x such that 


(3.4) 
where À = one is the wavelength and A = à is the Nyquist spacing. 

A one-dimensional array consists of measuring devices placed along a 
straight line (the z-axis here). Obviously, there must be some smallest 


58 Signal Processing: A Mathematical Approach 


bounded interval, say [A,B], that contains all these measuring devices. 
The aperture of the array is B-A, the length of the interval [A, B], in 
units of wavelength. As we just saw, the aperture is directly related to the 
number of Fourier coefficients of the function g(u) that we are measuring, 
and therefore, to the accuracy of the DFT reconstruction of g(u). This is 
usually described by saying that aperture determines resolution. As we saw, 
a one-dimensional array involves an inherent ambiguity, in that we cannot 
distinguish a signal from the angle 0 from one from the angle 27 — 0. In 
practice a two-dimensional configuration of sensors is sometimes used to 
eliminate this ambiguity. 

In numerous applications, such as astronomy, it is more realistic to 
assume that the sources of the signals are on the surface of a large sphere, 
rather than on the circumference of a large circle. In such cases, a one- 
dimensional array of sensors does not provide sufficient information and 
two- or three-dimensional sensor configurations are used. 

The number of Fourier coefficients of g(u) that we can measure, and 
therefore the resolution of the resulting reconstruction of f(0), is limited by 
the aperture. One way to improve resolution is to make the array of sensors 
longer, which is more easily said than done. However, synthetic-aperture 
radar (SAR) effectively does this. The idea of SAR is to employ the array 
of sensors on a moving airplane. As the plane moves, it effectively creates a 
longer array of sensors, a virtual array if you will. The one drawback is that 
the sensors in this virtual array are not all present at the same time, as in 
a normal array. Consequently, the data must be modified to approximate 
what would have been received at other times. 

The far-field approximation tells us that, at time t, every point x re- 
ceives from P(4) the same signal 


iw(t—D/c) ¢ (T ) 
e f ( : 
Since there is nothing special about the angle 5, we can say that the signal 
arriving from any angle 6, which originally spread out as concentric circles 
of constant value, has flattened out to the extent that, by the time it reaches 
our line of sensors, it is essentially constant on straight lines. This suggests 
the plane-wave approximation for signals propagating in three-dimensional 
space. As we shall see in Chapter 24, these plane-wave approximations are 
solutions to the three-dimensional wave equation. Much of array processing 
is based on such models of far-field propagation. 

As in the examples discussed previously, we do have more measurements 
we can take, if we use values of x other than those described by Equation 
(3.4). The issue will be what to do with these over-sampled measurements. 
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3.5.2 Over-Sampling 


One situation in which over-sampling arises naturally occurs in sonar 
array processing. Suppose that an array of sensors has been built to operate 
at a design frequency of wo, which means that we have placed sensors a 
distance of Ao apart in [A, B], where Ao is the wavelength corresponding 
to the frequency wo and Ag = 2o is the Nyquist spacing for frequency 
wo. For simplicity, we assume that the sensors are placed at points x that 


satisfy the equation 


for |n| < N. Now suppose that we want to operate the sensing at another 
frequency, say w. The sensors cannot be moved, so we must make do with 
sensors at the points x determined by the design frequency. 

Consider, first, the case in which the second frequency w is less than 
the design frequency wo. Then its wavelength A is larger than ào, and the 
Nyquist spacing A = à for w is larger than Ag. So we have over-sampled. 

The measurements taken at the sensors provide us with the integrals 


1 
J oidu, 
-1 
where K = % > 1. These are Fourier coefficients of the function g(w), 
viewed as defined on the interval [—K, K], which is larger than [—1, 1], and 
taking the value zero outside [—1, 1]. If we then use the DFT estimate of 
g(u), it will estimate g(u) for the values of u within [—1, 1], which is what 
we want, as well as for the values of u outside [—1, 1], where we already 
know g(u) to be zero. Once again, we can use the MDFT, the modified 
DFT, to include the prior knowledge that g(u) = 0 for u outside [—1, 1] to 
improve our reconstruction of g(u) and f(@). In sonar, for the over-sampled 
case, the interval [—1, 1] is called the visible region (although audible region 
seems more appropriate for sonar), since it contains all the values of u that 
can correspond to actual angles of plane-wave arrivals of acoustic energy. 
In practice, of course, the measured data may well contain components 
that are not plane-wave arrivals, such as localized noises near individual 
sensors, or near-field sounds, so our estimate of the function g(u) should 
be regularized to allow for these non-plane-wave components. 


3.5.3 Under-Sampling 


Now suppose that the frequency w that we want to consider is greater 
than the design frequency wo. This means that the spacing between the 
sensors is too large; we have under-sampled. Once again, however, we cannot 
move the sensors and must make do with what we have. 
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Now the measurements at the sensors provide us with the integrals 


1 
[se du, 


-1 


where K = £° < 1. These are Fourier coefficients of the function g(u), 
viewed as defined on the interval [—K, K], which is smaller than [—1, 1], 
and taking the value zero outside [—K, K]. Since g(u) is not necessarily 
zero outside |—K, K], treating it as if it were zero there results in a type 
of error known as aliasing, in which energy corresponding to angles whose 
u lies outside [—K, K] is mistakenly assigned to values of u that lie within 
[—K, K]. Aliasing is a common phenomenon; the strobe-light effect is alias- 
ing, as is the apparent backward motion of the wheels of stagecoaches in 
cowboy movies. In the case of the strobe light, we are permitted to view 
the scene at times too far apart for us to sense continuous, smooth motion. 
In the case of the wagon wheels, the frames of the film capture instants of 
time too far apart for us to see the true rotation of the wheels. 


3.6 Resolution Limitations 


As we have seen, in the unknown-strength problem the number of 
L 


Fourier coefficients we can measure is limited by the ratio +. Additional 
measurements in the far field can provide additional information about the 
function f(x), but extracting that information becomes an increasingly ill- 
conditioned problem, one more sensitive to noise the more data we gather. 

In the line-array problem just considered, there is, in principle, no limit 
to the number of Fourier coefficients we can obtain by measuring at the 
points nA for integer values of n; the limitation here is of a more practical 
nature. 

In sonar, the speed of sound in the ocean is about 1500 meters per 
second, so the wavelength associated with 50 Hz is A = 30 meters. The 
Nyquist spacing is then 15 meters. A towed array is a line array of sensors 
towed behind a ship. The length of the array, and therefore the number 
of Nyquist-spaced sensors for passive sensing at 50 Hz, is, in principle, 
unlimited. In practice, however, cost is always a factor. In addition, when 
the array becomes too long, it is difficult to maintain it in a straight-line 
position. 

Radar imaging uses microwaves with a wavelength of about one inch, 
which is not a problem; synthetic-aperture radar can also be used to sim- 
ulate a longer array. In radio astronomy, however, the wavelengths can 
be more than a kilometer, which is why radio-astronomy arrays have to 
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be enormous. For radio-wave imaging at very low frequencies, a sort of 
synthetic-aperture approach has been taken, with individual antennas lo- 
cated in different parts of the globe. 


3.7 Using Matched Filtering 


We saw previously that the signal that x receives from P(4) at time t 
is the same for all x. If we could turn the z-axis counter-clockwise through 
an angle of ¢, then the signals received from P(5 + ¢) at time t would be 
the same for all x. Of course, we usually cannot turn the array physically 
in this way; however, we can steer the array mathematically. This mathe- 
matical steering makes use of matched filtering. In certain applications it 
is reasonable to assume that only relatively few values of the function f(@) 
are significantly nonzero. Matched filtering is a commonly used method for 
dealing with such cases. 


3.7.1 A Single Source 
To take an extreme case, suppose that f(69) > 0 and f(0) = 0, for all 
0 F 0o. The signal received at time t at x is then 


s(x t) = el (t—-D/e) #(Qy) ete cos ĝo 


? 


w 


Our objective is to determine 4. 

Suppose that we multiply s(x,t) by e~**< °°%, for arbitrary values of 
0. When one of the arbitrary values is 0 = 69, the product is no longer 
dependent on the value of x; that is, the resulting product is the same for 
all x. In practice, we can place sensors at some finite number of points z, 
and then sum the resulting products over the x. When the arbitrary @ is 
not o, we are adding up complex exponentials with distinct phase angles, 
so destructive interference takes place and the magnitude of the sum is 
not large. In contrast, when 0 = 0o, all the products are the same and the 
sum is relatively large. This is matched filtering, which is commonly used 
to determine the true value of 4. 


3.7.2 Multiple Sources 


Having only one signal source is the extreme case; having two or more 
signal sources, perhaps not far apart in angle, is an important situation, as 
well. Then resolution becomes a problem. When we calculate the matched 
filter in the single-source case, the largest magnitude will occur when 0 = 
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0o, but the magnitudes at other nearby values of 0 will not be zero. How 
quickly the values fall off as we move away from 69 will depend on the 
aperture of the array; the larger the aperture, the faster the fall-off. When 
we have two signal sources near to one another, say 6; and 02, the matched- 
filter output can have its largest magnitude at a value of 0 between the 
two angles 0, and 02, causing a loss of resolution. Again, having a larger 
aperture will improve the resolution. 


3.8 An Example: The Solar-Emission Problem 


In [15] Bracewell discusses the solar-emission problem. In 1942, it was 
observed that radio-wave emissions in the one-meter wavelength range were 
arriving from the sun. Were they coming from the entire disk of the sun 
or were the sources more localized, in sunspots, for example? The problem 
then was to view each location on the sun’s surface as a potential source of 
these radio waves and to determine the intensity of emission corresponding 
to each location. 

For electromagnetic waves the propagation speed is the speed of light 
in a vacuum, which we shall take here to be c = 3 x 108 meters per second. 
The wavelength A for gamma rays is around one Angstrom, that is, 1071? 
meters, which is about the diameter of an atom; for x-rays it is about one 
millimicron, or 107° meters. The visible spectrum has wavelengths that 
are a little less than one micron, that is, 107 meters, while infrared radia- 
tion (IR), predominantly associated with heat, has a wavelength somewhat 
longer. Infrared radiation with a wavelength around 6 or 7 microns can 
be used to detect water vapor; we use near IR, with a wavelength near 
that of visible light, to change the channels on our TV sets. Shortwave ra- 
dio has a wavelength around one millimeter. Microwaves have wavelengths 
between one centimeter and one meter; those used in radar imaging have 
a wavelength about one inch and can penetrate clouds and thin layers of 
leaves. Broadcast radio has a \ running from about 10 meters to 1000 me- 
ters. The so-called long radio waves can have wavelengths several thousand 
meters long, necessitating clever methods of large-antenna design for radio 
astronomy. 

The sun has an angular diameter of 30 min. of arc, or one-half of a 
degree, when viewed from earth, but the needed resolution was more like 
3 min. of arc. Such resolution requires a larger aperture, a radio telescope 
1000 wavelengths across, which means a diameter of 1km at a wavelength of 
1 meter; in 1942 the largest military radar antennas were less than 5 meters 
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across. A solution was found, using the method of reconstructing an object 
from line-integral data, a technique that surfaced again in tomography. 


3.9 Estimating the Size of Distant Objects 


Suppose, in the previous example of the unknown strength problem, 
we assume that f(x) = B, for all x in the interval [—L, L], where B > 0 
is the unknown brightness constant, and we don’t know L. More realistic, 
two-dimensional versions of this problem arise in astronomy, when we want 
to estimate the diameter of a distant star. 

In this case, the measurement of the signal at the point P gives us 


E f(x) cos ean 


C 


k w cos 0 2Bc . (Lwcosé 
=B cos x | dz = sin | ———— ] , 
LL c w cos 0 c 


when cos 0 Æ 0, whose absolute value is then the strength of the signal at P. 
Notice that we have zero signal strength at P when the angle 0 associated 
with P satisfies the equation 


without 
cos = 0. 


But we know that the first positive zero of the sine function is at m, so the 
signal strength at P is zero when @ is such that 


Lw cos @ 
se I ae 


c 
If 
Lw 
= a, 
c 
then we can solve for L and get 
TC 
L= : 
w cos 0 


When Lw is too small, there will be no angle @ for which the received signal 
strength at P is zero. If the signals being sent are actually broadband, 
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meaning that the signals are made up of components at many different 
frequencies, not just one w, which is usually the case, then we might be 
able to filter our measured data, keep only the component at a sufficiently 
high frequency, and then proceed as before. 
But even when we have only a single frequency w and Lw is too small, 
T 


there is something we can do. The received strength at 0 = 3 is 


L 
F.(0) = zf dx = 2BL. 
—L 


If we knew B, this measurement alone would give us L, but we do not 
assume that we know B. At any other angle, the received strength is 


2Bc Lw cos 
Fe(y) = weosd ( c ) 
Therefore, 
sin(H(0)) 
ror) = ZE, 
where 
H(0) = Lw cos0 
C 


From the measured value Fe(y)/Fe(0) we can solve for H(0) and then for 
L. In actual optical astronomy, atmospheric distortions make these mea- 
surements noisy and the estimates have to be performed more carefully. 
This issue is discussed in more detail in Chapter 2, in Section 2.13 on 
Two-Dimensional Fourier Transforms. 

There is a simple relationship involving the intrinsic luminosity of a 
star, its distance from earth, and its apparent brightness; knowing any two 
of these, we can calculate the third. Once we know these values, we can 
figure out how large the visible universe is. Unfortunately, only the appar- 
ent brightness is easily determined. As Alan Lightman relates in [111], it 
was Henrietta Leavitt’s ground-breaking discovery, in 1912, of the “period- 
luminosity” law of variable Cepheid stars that eventually revealed just how 
enormous the universe really is. Cepheid stars are found in many parts of 
the sky. Their apparent brightness varies periodically. As Leavitt, working 
at the Harvard College Observatory, discovered, the greater the intrinsic 
luminosity of the star, the longer the period of variable brightness. The 
final step of calibration was achieved in 1913 by the Danish astronomer 
Ejnar Hertzsprung, when he was able to establish the actual distance to a 
relatively nearby Cepheid star, essentially by parallax methods. 

There is a wonderful article by Eddington [69], in which he discusses 
the use of signal processing methods to discover the properties of the star 
Algol. This star, formally Algol (Beta Persei) in the constellation Perseus, 
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turns out to be three stars, two revolving around the third, with both of the 
first two taking turns eclipsing the other. The stars rotate around their own 
axes, as our star, the sun, does, and the speed of rotation can be estimated 
by calculating the Doppler shift in frequency, as one side of the star comes 
toward us and the other side moves away. It is possible to measure one side 
at a time only because of the eclipse caused by the other revolving star. 


3.10 The Transmission Problem 


Now we change the situation and suppose that we are designing a broad- 
casting system, using transmitters at each x in the interval |- L, L]. 


3.10.1 Directionality 


At each x we will transmit f(x)e’**, where both f(x) and w are chosen 
by us. We now want to calculate what will be received at each point P in 
the far field. We may wish to design the system so that the strengths of the 
signals received at the various P are not all the same. For example, if we 
are broadcasting from Los Angeles, we may well want a strong signal in the 
north and south directions, but weak signals east and west, where there are 
fewer people to receive the signal. Clearly, our model of a single-frequency 
signal is too simple, but it does allow us to illustrate several important 
points about directionality in array processing. 


3.10.2 The Case of Uniform Strength 


For concreteness, we investigate the case in which f(x) = 1 for |z| < L. 
In this case, the measurement of the signal at the point P gives us 


F(P) i f(x) cos (22) dex 
L 


i cos (= costs) dx 
-L 


2c f (= cos) 

= sin | ——— _}, 
w cos 0 c 

when cos # 0. The absolute value of F(P) is then the strength of the 

signal at P. In Figures 3.2 through 3.7 we see the plots of the function 

s7F(P), for various values of the aperture 
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FIGURE 3.2: Relative strength at P for A = 0.5. 


FIGURE 3.3: Relative strength at P for A = 1.0. 
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FIGURE 3.4: Relative strength at P for A = 1.5. 


FIGURE 3.5: Relative strength at P for A = 1.8. 
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FIGURE 3.6: Relative strength at P for A = 3.2. 


FIGURE 3.7: Relative strength at P for A = 6.5. 
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3.10.2.1 Beam-Pattern Nulls 


Is it possible for the strength of the signal received at some P to be 
zero? As we saw in the previous section, to have zero signal strength, that 
is, to have F'(P) = 0, we need 


without 
cos = 
Therefore, we need 
Lw cos 6 
c — E] 


for some positive integers n > 1. Notice that this can happen only if 


Lwr 2L 
< aes 


n < 7 y 


Therefore, if 2L < A, there can be no P with signal strength zero. The 
larger 2L is, with respect to the wavelength A, the more angles at which 
the signal strength is zero. 


3.10.2.2 Local Maxima 


Is it possible for the strength of the signal received at some P to be a 
local maximum, relative to nearby points in the far field? We write 


Pera en (==) = 2Lsinc (H(0)), 


w cos 0 
where T b 
H(0) = Ww COS 
c 
and in H(0) 
sin 
inc (H(0)) = ————— 
sinc (H0) = 


for H (0) Æ 0, and equals one for H (0) = 0. The value of A used previously 
is then A = H (0). 

Local maxima or minima of F(P) occur when the derivative of 
sinc (H (0)) equals zero, which means that 


H (0) cos H (0) — sin H (0) = 0, 


or 
tan H(0) = H (0). 
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If we can solve this equation for H(@) and then for 0, we will have found 
angles corresponding to local maxima of the received signal strength. The 
largest value of F (P) occurs when 0 = 3, and the peak in the plot of F(P) 
centered at 0 = $ is called the main lobe. The smaller peaks on either side 
are called the grating lobes. We can see grating lobes in some of the polar 


plots. 


3.11 The Laplace Transform and the Ozone Layer 


We have seen how values of the Fourier transform can arise as measured 
data. The following examples, the first taken from Twomey’s book [156], 
show that values of the Laplace transform can arise in this way as well. 


3.11.1 The Laplace Transform 


The Laplace transform of the function f(x), defined for 0 < £x < +o, 
is the function 


+00 
F(s) -f f(aje dz. 


3.11.2 Scattering of Ultraviolet Radiation 


The sun emits ultraviolet (UV) radiation that enters the earth’s atmo- 
sphere at an angle 0o that depends on the sun’s position, and with intensity 
I(0). Let the z-axis be vertical, with « = 0 at the top of the atmosphere 
and x increasing as we move down to the earth’s surface, at x = X. The 
intensity at x is given by 


Lepr Qe re 
Within the ozone layer, the amount of UV radiation scattered in the direc- 


tion 0 is given by 
S(O, 80) 1(O)e7 **/ °% Ap, 


where $(0, 0) is a known parameter, and Ap is the change in the pressure 
of the ozone within the infinitesimal layer [x, «+ Az], and so is proportional 
to the concentration of ozone within that layer. 


3.11.3 Measuring the Scattered Intensity 


The radiation scattered at the angle 0 then travels to the ground, a 
distance of X — x, weakened along the way, and reaches the ground with 
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intensity 
S(0, o) I(0)e7*/ cos ĝo e7 h(X—2)/ cos 9 An. 


The total scattered intensity at angle 0 is then a superposition of the in- 
tensities due to scattering at each of the thin layers, and is then 


X 
5(0,80)1(O)er*X/e%% f e=2Pdp, 
(0) 


where 


1 1 
i E =) 


This superposition of intensity can then be written as 


xX 
S(O, Go) T(0)e x7 e050 | e` Pp (x)dz. 
0 


3.11.4 The Laplace Transform Data 


Using integration by parts, we get 


X X 
I cy (a)de = pX) - p0) +8 | e-Ptp(x)da. 


Since p(0) = 0 and p(X) can be measured, our data is then the Laplace 


transform value 
+00 
J e pan 
0 


note that we can replace the upper limit X with +00 if we extend p(x) as 
zero beyond z = X. 

The variable 6 depends on the two angles 0 and 69. We can alter 0 as 
we measure and ĝo changes as the sun moves relative to the earth. In this 
way we get values of the Laplace transform of p(x) for various values of £8. 
The problem then is to recover p(x) from these values. Because the Laplace 
transform involves a smoothing of the function p(x), recovering p(x) from 
its Laplace transform is more ill-conditioned than is the Fourier transform 
inversion problem. 


3.12 The Laplace Transform and Energy Spectral 
Estimation 


In x-ray transmission tomography, x-ray beams are sent through the 
object and the drop in intensity is measured. These measurements are 
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then used to estimate the distribution of attenuating material within the 
object. A typical x-ray beam contains components with different energy 
levels. Because components at different energy levels will be attenuated 
differently, it is important to know the relative contribution of each energy 
level to the entering beam. The energy spectrum is the function f(E) that 
describes the intensity of the components at each energy level E > 0. 


3.12.1 The Attenuation Coefficient Function 


Each specific material, say aluminum, for example, is associated with 
attenuation coefficients, which is a function of energy, which we shall denote 
by u(E). A beam with the single energy E passing through a thickness x of 
the material will be weakened by the factor e~““)*, By passing the beam 
through various thicknesses x of aluminum and registering the intensity 
drops, one obtains values of the absorption function 


Rey = f 7 Byer ®2dE. (3.5) 


Using a change of variable, we can write R(x) as a Laplace transform. 


3.12.2 The Absorption Function as a Laplace Transform 


For each material, the attenuation function u(E) is a strictly decreasing 
function of E, so u(E) has an inverse, which we denote by g; that is, 
g(t) = E, for t = u( E). Equation (3.5) can then be rewritten as 


OE i HI g (tat. 


We see then that R(x) is the Laplace transform of the function r(t) = 
f(g(t))g’(t). Our measurements of the intensity drops provide values of 
R(x), for various values of z, from which we must estimate the functions 
r(t), and, ultimately, f(E). 
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Chapter Summary 


All of the techniques discussed in this book deal, in one way or another, 
with one fundamental problem: Estimate the values of a function f(z) 
from finitely many (usually noisy) measurements related to f(x); here x 
can be a multi-dimensional vector, so that f can be a function of more than 
one variable. To keep the notation relatively simple here, we shall assume, 
throughout this chapter, that x is a real variable, but all of what we shall 
say applies to multi-variate functions as well. In this chapter we begin our 
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discussion of the use of finite-parameter models, a topic to which we shall 
return several times throughout this book. 


4.2 Finite Fourier Series 


In this section we present one of the most useful finite-parameter model, 
the finite Fourier series. The notation may seem unusual, but it is chosen 
for convenience later, when we discuss the Fast Fourier Transform (FFT). 

Let f : [0, N] > C have Fourier series 


1 x 
f(z) 7 N 2 F,e-Ur®, 
where 
N. -27k 
Fk = f(ajen “dz. (4.1) 
0 
Note that T: 
T 
aed (=) 


where F(y) is the Fourier transform of f(x). In order to calculate any Fj, 
we need all of f(x). 
Suppose that we model f(x) on [0, N] using a finite Fourier series 


We can still calculate the Fẹ using Equation (4.1), but now there are other 
ways. 

Suppose we obtain N values of f(x), say f (£n), for n = 0,1,..., N — 1. 
Such situations arise, for example, in time-series analysis, where x repre- 
sents time and we are able to measure the function f(x) at some finitely 
many different times. The function f(x) could represent acoustic pressure 
coming from speech, current values of a particular stock on the Stock Ex- 
change, the temperature at time x in a particular place, and so on. We may 
want to model f(x) to estimate values of f(x) we were unable to measure, 
perhaps for prediction, or to break f(x) up into finitely many sinusoidal 
components. This latter problem is important in digital sound recording 
and speech recognition. 
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Once we have the data f(x,), for n = 0,1,...,N — 1, we can then get 
the Fk by solving a system of N linear equations in N unknowns: 


N- 


L 


1 
W Fee iN N Tn, (4.2) 
k=0 


fta 


Solving this system typically requires roughly N° complex multiplications, 
which, for many applications in which N is in the thousands, is prohibitively 
expensive and time-consuming. However, if we have the freedom to select 
the £n and choose x, = n, then solving the system becomes much simpler, 
because of discrete orthogonality. 

With £n = n, the solution of the system of linear equations 


FRS ee 
= one (4.3) 
k=0 
is 
N-1 
F= So f(nje™, (4.4) 
n=0 


for k = 0,1,...,.N — 1. The proof of this assertion is contained in the fol- 
lowing exercises. 


Ex. 4.1 Use the formula for the sum of a finite geometric progression to 
show that 


N-1 Nt 
. peut 1)t sin =" 

J e =e Z, (4.5) 
sin io 
n=0 2 


Ex. 4.2 Prove the assennon in Equation (4.4) by multiplying both sides of 


Equation (4.3) by e E , and summing over n. Interchange the order of 
summation and use Equation (4.5). 


The formula in Equation (4.5) is perhaps the most important in sig- 
nal processing and we shall encounter it several times later in this book. 
It describes coherent summation, the phenomenon of constructive and de- 
structive interference, and is the basic formula in sonar and radar. It also 
arises in matched filtering, optimal detection theory, and the DFT estima- 
tion of the Fourier transform. 
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4.3 The DFT and the Finite Fourier Series 


In the unknown strength problem we saw that measurements in the 
far field could give us finitely many values of the Fourier coefficients of a 
support-limited function. Suppose now that f : [0, N] — C is such an un- 
known function, and we have obtained the Fourier-transform values F (23E) 
for k = 0,1,..., N — 1. It is reasonable to use the DFT to estimate f(z : 


’ 
N-1 


1 

= Rr 

N ke 
k=0 


f(z) ~ forr(2) = 


The DFT looks just like the finite Fourier series we discussed previously. 
We can calculate the N values of forr(x) at the points x = n using the 
formula in Equation (4.3): 
T N-1 
j2akn 
forr(n ie * (4.6) 
k=0 


Note, however, that the context has changed. Previously, we assumed that 
we had actual values of f(x) at the points x = n and we used the finite 
Fourier series to model f(x). Now we are assuming that it is finitely many 
values of the actual Fourier transform, F'(y), that we have obtained, and 
we want to use those values to estimate f(x). What we are getting when 
we use Equation (4.6) are not actual values of f(x) itself, but of the DFT 
estimator of f(x). 

As we noted previously, solving for the Fk using the system described by 
Equation (4.2) would require roughly N? complex multiplications. When 
we select £n = n we can solve the system in Equation (4.3) in N? complex 
multiplications. But for very large N, even N? is too large. Fortunately, 
there is the Fast Fourier Transform (FFT), which we shall consider in detail 
in Chapter 8. The FFT reduces the computational cost to roughly N log, N 
complex multiplications. 


4.4 The Vector DFT 


The discussion in the previous sections motivates the definition of 
the vector DFT (vDFT). Given any column vector f in CN with entries 
fo, fi, ---, fN-1, we define the vector DFT (vDFT) of f to be the complex 


Finite-Parameter Models 77 


vector F in C% having the entries 


N-1 
fe ne (4.7) 
n=0 


for k = 0,1,..., N — 1. From our previous discussion, we know that we then 
have 


1 N-1 
- 2ank 
fo = yD Fre” N, 


for n = 0,1,..., N — 1. 

Most texts on signal processing call the vector F the DFT of the vector 
f. I think this is bad terminology, as I shall explain. Suppose we have data 
f(n), for n = 0,1,...,N — 1, and the Fourier transform function F (y) is 
unknown, but known to be supported on the interval [0,27]. We want to 
estimate F'(y) using the data. One way is to use the DFT estimate, 


N-1 


Forr(7) = > f(n). 


n=0 


The next step would be to plot our estimate. To do this we select some 
finitely many values of y, say yz, for k = 0,1,..., K — 1, and evaluate 
Fprr(yk). If we choose K = N and yk = 2k we get Equation (4.7), 
with Fẹ = F (3). If we use the FFT, we can calculate all the Fẹ quickly. 
However, the FFT prefers to have N equal to some power of two. If, for 
example, we have N = 250, we can trick the FFT by defining f(250) = 
f(251) = ... = f(255) = 0, and changing N = 250 to N = 256. The DFT 
estimate is still the same function of the continuous variable y, but now 
the FFT will evaluate the DFT at 256 equi-spaced points with the interval 
[0, 27). In fact, if we should want to generate a plot of the DFT that had, 
say, 1024 grid points, we could simply augment our original data set with 
sufficiently many zero values, and then perform the FFT; this is called 
zero-padding. In each case, we calculate a vector F, but the sizes change as 
we augment the data with more zero values. To call each of these vectors F 
the DFT seems to me to be wrong. Each one is a vDFT of a certain set of 
data, original or augmented, while the DFT remains the same function of 
the continuous variable y. It is important to remember that the values Fk 
we calculate are not values of the actual F(y), but of the DFT estimator 
of F(y). This point is sometimes missed in the literature on the subject. 
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4.5 The Vector DFT in Two Dimensions 


We consider now a complex-valued function f(x,y) of two real variables, 
with Fourier transformation 


F(a, 8) = / J f(z, y) Ct dady. 


Suppose that F(a, 8) = 0, except for a and £ in the interval [0,27]; this 
means that the function F(a, 8) represents a two-dimensional object with 
bounded support, such as a picture. Then F(a, 8) has a Fourier series 
expansion 


Yo 5 f(m, njet et? (4.8) 


m=— CO n=— oo 


for 0 <a < 2r and0 < 6 < 2r. 

In image processing, F(a, 8) is our two-dimensional analogue image, 
where a and ( are continuous variables. The first step in digital image 
processing is to digitize the image, which means forming a two-dimensional 
a of numbers Fy, hy for j, k = 0, 1, ..., N — 1. For concreteness, we let the 

Fj, be the values F (3j, 4k). 
From Equation (4.8) we can write 


27 - 27 < < iZ jm iZ kn 
Fa =P (FA) = D D netni, 
for j,k =0,1,...,N — 1. 
We can also find coefficients fm,n, for m,n = 0,1, ..., N — 1, such that 


-1N-1 


2m. 2m jm i£ kn 
rar (ad) SE etnan, 


m=0 n=0 


for j,k =0,1,...,.N—1. These fm» are only approximations of the values 
f (m,n), as we shall see. 

Just as in the one-dimensional case, we can make use of orthogonality 
to find the coefficients fmn. We have 


, NoN ra 
; 2m 
fmn = We > F (Eimi hn) et IM ii kn, (4.9) 
0 


g=0 k= 


for m,n = 0, 1,..., N — 1. Now we show how the fm,n can be thought of as 
approximations of the f(m, n). 
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We know from the Fourier Inversion Formula in two dimensions, Equa- 
tion (2.18), that 


2m 2m 
fmn) = 5 | | F(a, B)e~ 0+") dad. (4.10) 


When we replace the right side of Equation (4.10) with a Riemann sum, 
we get 


the right side is precisely fm», according to Equation (4.9). 

Notice that we can compute the fm» from the Fj, using one- 
dimensional vDFTs. For each fixed j we compute the one-dimensional 
vDFT 


for n = 0,1,..., N—1. Then for each fixed n we compute the one-dimensional 
vDFT 


T 


N—1 

—i2 jm 

fmm Y Gige OR, 
j=0 


for m = 0,1,...,N — 1. From this, we see that estimating f(x,y) by calcu- 
lating the two-dimensional vDFT of the values from F(a, 8) requires us to 
obtain 2N one-dimensional vector DFTs. 

Calculating the fm,n from the pixel values Fi, is the main operation 
in digital image processing. The fm,» approximate the spatial frequencies 
in the image and modifications to the image, such as smoothing or edge 
enhancement, can be made by modifying the values fm,n. Improving the 
resolution of the image can be done by extrapolating the fm», that is, by 
approximating values of f(x,y) other than z = m and y = n. Once we 
have modified the fm n, we return to the new values of F} ẹ, so calculating 
Fj k from the fm,n is also an important step in image processing. 

In some areas of medical imaging, such as transmission tomography 
and magnetic-resonance imaging, the scanners provide the fm». Then the 
desired digitized image of the patient is the array Fj. In such cases, the 
fm,n are considered to be approximate values of f(m,n). For more on 
the role of the two-dimensional Fourier transform in medical imaging, see 
Chapter 11 on transmission tomography. 

Even if we managed to have the true values, that is, even if fmm = 

27 


m,n), the values F; p are not the true values F (22m, n). The number 
Í, N” N 
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Fj, is a value of the DFT approximation of F(a, 8). This DFT approxi- 
mation is the function given by 


The number Fj x is the value of this approximation at the point a = 2a j 
and 6 = 2a k. In other words, 


2T . 2m 
Fy, =F — j, —k 
j,k ver (Fi N Ns 


for j,k = 0,1,...,N — 1. How good this discrete image is as an approx- 
imation of the true F(a, 8) depends primarily on two things: first, how 
accurate an approximation of the numbers f(m,n) the numbers fm,n are; 
and second, how good an approximation of the function F(a, 8) the func- 
tion Fprr(a, b) is. 

We can easily see now how important the Fast Fourier Transform algo- 
rithm is. Without the Fast Fourier Transform to accelerate the calculations, 
obtaining a two-dimensional vDFT would be prohibitively expensive. 


4.6 The Issue of Units 


When we write cosm = —1, it is with the understanding that m is a 
measure of angle, in radians; the function cos will always have an inde- 
pendent variable in units of radians. Therefore, when we write cos(aw), we 
understand the product zw to be in units of radians. If z is measured in 
seconds, then w is in units of radians per second; if x is in meters, then w is 
in units of radians per meter. When x is in seconds, we sometimes use the 
variable 5+; since 27 is then in units of radians per cycle, the variable = 
is in units of cycles per second, or Hertz. When we sample f(x) at values 
of x spaced A apart, the A is in units of z-units per sample, and the recip- 
rocal, 4, which is called the sampling frequency, is in units of samples per 
x-units. If x is in seconds, then A is in units of seconds per sample, and x 
is in units of samples per second. 


Finite-Parameter Models 81 


4.7 Approximation, Models, or Truth? 


We mentioned previously that, when we model f(x) using a finite 
Fourier series, we may want to analyze f(x) to determine its sinusoidal 
components. But does f(x) actually contain these sinusoidal components 
in any real sense? An example from Fourier-series expansion will clarify 
this issue. 

Consider the function f(x) = sin x, for 0 < x < m. The function g(x), 
defined by g(x) = f(x), for 0 < x < m, and g(x) = f(—x), for —r < x <0, 
can be extended to a continuous even function with period 27. The Fourier 
series for g(x) is 


2 241+ cosnm 
g(x) = F = Poe 


When we restrict our attention to x in the interval [0,7], we have the 
function sin x expressed as an infinite sum of cosine functions. It is true, 
in a sense, that the sine function on [0,7] is made up of infinitely many 
cosines, and any partial sum of this infinite cosine series can be viewed as 
an approximation of the function sing on [0,7]. However, is it really the 
kind of truth about the function f(x) that we are seeking? 


4.8 Modeling the Data 


In time-series analysis, we have some unknown function of time, f(t), 
and we measure its values f(t,) at the N sampling points t = tn, n = 
1,...,N. There are several different possible objectives that we may have 
at this point. 


4.8.1 Extrapolation 


We may want to estimate values of f(t) at points t at which we do not 
have measurements; these other points may represent time in the future, for 
example, and we are trying to predict future values of f(t). In such cases, 
it is common to adopt a model for f(t), which is typically some function of 
t with finitely many as yet undetermined parameters, such as a polynomial 
or a sum of trig functions. We must select our model with care, particularly 
if the data is assumed to be noisy, as most data is. Even though we may 
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have a large number of measurements, it may be a mistake to model f(t) 
with as many parameters as we have data. 

We do not really believe that f(t) is a polynomial or a finite Fourier 
series. We may not even believe that the model is a good approximation of 
f(t) for all values of t. We do believe, however, that adopting such a model 
will enable us to carry out our prediction task in a reasonably accurate 
way. The task may be something like predicting the temperature at noon 
tomorrow, on the basis of noon-time temperatures for the previous five 
days. 


4.8.2 Filtering the Data 


Suppose that the values f(t,) are sampled data from an old recording 
of a singer. We may want to clean up this digitized data, in order to be able 
to recapture the original sound. Now we may only desire to modify each 
of the values f(t,) in some way, to improve the quality. To perform this 
restoring task, we may model the data as samples of a finite Fourier series, 
or, more generally, as a finite sum of sinusoids in which the frequencies yk 
are chosen by us. We then solve for the parameters. 

To clean up the sound, we may modify the values of some of the pa- 
rameters. For example, we may believe that certain of the frequencies come 
primarily from a noise component in the recording. To remove, or at least 
diminish, this component, we can reduce the associated coefficients. We 
may feel that the original recording technology failed to capture some of 
the higher notes sung by the soprano. Then we can increase the values of 
those coefficients associated with those frequencies that need to be restored. 
Obviously, restoring old recordings of opera singers is more involved than 
this, but you get the idea. 

The point here is that we need not believe that the entire recording can 
be accurately described, or even approximated, by a finite sum of complex 
exponential functions. But using a finite sum of sinusoids does give another 
way to describe the measured data, and as such, another way to modify 
this data, namely by modifying the coefficients of the sinusoids. We do not 
need to believe that the entire opera can be accurately approximated by 
such a sum in order for this restoring procedure to be helpful. 

Note that if our goal is to recapture a high note sung by the soprano, we 
do not really need to use samples of the function f(t) that correspond to 
times when only the tenor was on stage singing. It would make more sense 
to process only those measurements taken right around the time the high 
note was sung by the soprano. This is short-time Fourier analysis, an issue 
that we deal with when we discuss time-frequency analysis and wavelets. 
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4.9 More on Coherent Summation 


We begin this section with an exercise. 


Ex. 4.3 On a blank sheet of paper, draw a horizontal and vertical axis. 
Starting at the origin, draw a vector with length one unit (a unit can be, 
say, one inch), in an arbitrary direction. Now, from the tip of the first 
vector, draw another vector of length one, again in an arbitrary direction. 
Repeat this process several times, using M vectors in all. Now measure the 
distance from the origin to the tip of the last vector drawn. Compare this 
length with the number M, which would be the distance from the origin to 
the tip of the last vector, if all the vectors had had the same direction. 


This exercise reveals the important difference between coherent and 
incoherent summation, or, if you will, between constructive and destructive 
interference. Each of the unit vectors drawn can be thought of as a complex 
number etm, where m is its arbitrary angle. The distance from the origin 
to the tip of the last vector drawn is then 


et 4 e244 lM] 


If all the angles @,, are equal, then this distance is M; in all other cases 
the distance is quite a bit less than M. The distinction between coherent 
and incoherent summation plays a central role in signal processing, as well 
as in quantum physics, as we discuss briefly in the next section. 


4.10 Uses in Quantum Electrodynamics 


In his experiments with light, Newton discovered the phenomenon of 
partial reflection. The proportion of the light incident on a glass surface 
that is reflected varies with the thickness of the glass, but the proportion 
oscillates between zero and about sixteen percent as the glass thickens. He 
tried to explain this puzzling behavior, but realized that he had not ob- 
tained a satisfactory explanation. In his beautiful small book “QED: The 
Strange Theory of Light and Matter” [71], the physicist Richard Feynman 
illustrates how the quantum theory applied to light, quantum electrody- 
namics or QED, can be used to unravel many phenomena involving the 
interaction of light with matter, including the partial reflection observed by 
Newton, the least time principle, the array of colors we see on the surface 
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of an oily mud puddle, and so on. He is addressing an audience of non- 
physicists, including even some non-scientists, and avoids mathematics as 
much as possible. The one mathematical notion that he uses repeatedly is 
the addition of two-dimensional vectors pointing in a variety of directions, 
that is, coherent and incoherent summation. The vector sum is the proba- 
bility amplitude of the event being discussed, and the square of its length 
is the probability of the event. 


4.11 Using Coherence and Incoherence 


Suppose we are given as data the M complex numbers dm = e’””, for 
m = 1,..., M, and we are asked to find the real number y. We can exploit 
coherent summation to get our answer. 

First of all, from the data we have been given, we cannot distinguish 7 
from y + 27, since, for all integers m 


etm (+27) my ,2mnt = e'™7(1) 


=e e imy . 


=e 
Therefore, we assume, from the beginning, that the y we want to find lies 
in the interval [—7,7). Note that we could have selected any interval of 
length 27, not necessarily [—7,7); if we have no prior knowledge of where 
y is located, the intervals [—7, 7) or [0, 27) are the most obvious choices. 


4.11.1 The Discrete Fourier Transform 


Now we take any value w in the interval [—7,7), multiply each of the 
numbers dm by e~*””, and sum over m to get 


M 
DFTalo) = X dye: (4.11) 


m=1 


The sum we denote by DFTa will be called the discrete Fourier transform 
(DFT) of the data (column) vector d = (d3, ..., dm)”. We define the column 
vector e,, to be 
ey = Coe, gee 
which allows us to write DFTq = eld, where the dagger denotes conjugate 
transformation of a matrix or vector. 
Rewriting the exponential terms in the sum in Equation (4.11), we 


obtain 


M M 
DFTalw) =) ane "= yer, 
m=1 =1 
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Performing this calculation for each w in the interval [—7, 7), we obtain the 
function DFTg(w). For each w, the complex number DF'Tg(w) is the sum 
of M complex numbers, each having length one, and angle 0, = m(y—w). 
So long as w is not equal to y, these ĝm are all different, and DF'Ty(w) is an 
incoherent sum; consequently, |DFTa(w)| will be smaller than M. However, 
when w = y, each 0m equals zero, and DFTa(w) = |DFTa(w)| = M; the 
reason for putting the minus sign in the exponent e~"”"” is so that we 
get the term y — w, which is zero when y = w. We find the true y by 
computing the value |DFTa(w)| for finitely many values of w, plot the 
result and look for the highest value. Of course, it may well happen that 
the true value w = y is not exactly one of the points we choose to plot; 
it may happen that the true y is half way between two of the plot’s grid 
points, for example. Nevertheless, if we know in advance that there is only 
one true y, this approach will give us a good idea of its value. 

In many applications, the number M will be quite large, as will be the 
number of grid points we wish to use for the plot. This means that the 
number DF'Tq(w) is a sum of a large number of terms, and that we must 
calculate this sum for many values of w. Fortunately, we can use the FFT 
for this. 


Ex. 4.4 The Dirichlet kernel of size M is defined as 


M 


Dy (a) = lay ene 


Use Equation (4.5) to obtain the closed-form expression 


B= sin((M + 5)x) 


note that Dyg(x) is real-valued. 


Ex. 4.5 Obtain the closed-form expressions 


M > ({(M-N+1 
M+N 
cos mz = cos ( t z) A (4.12) 
sin Z 
m=N 2 
and 
M >. (M-N+1 
M+N \s 
So sin mg = sin ( T z) Aa (4.13) 
2 sin 5 


Hint: Recall that cosma and sinmaz are the real and imaginary parts of 
elma | 


86 Signal Processing: A Mathematical Approach 


Ex. 4.6 Obtain the formulas in the previous exercise using the trigonomet- 
ric identity 


i i. 1 ‘ 1 see () GG) 
sin| (n+ 5) 2 sin{ |n- 5] x} = 2sin |3) cos(na). 
Ex. 4.7 Graph the function Dm(x) for various values of M. 


We note in passing that the function Djs(x) equals 2M +1 for x = 0 
and equals zero for the first time at z = se This means that the main 
lobe of Dm(x), the inverted parabola-like portion of the graph centered at 
x = 0, crosses the z-axis at x = 27/(2M + 1) and x = —27/(2M +1), so 
its height is 2M + 1 and its width is 47/(2M +1). As M grows larger the 
main lobe of Dm(x) gets higher and thinner. 

In the exercise that follows we examine the resolving ability of the DFT. 


Suppose we have M equi-spaced samples of a function f(x) having the form 
f(x) =e 4 e, 


where qı and q2 are in the interval (—7,77). If M is sufficiently large, the 
DFT should show two peaks, at roughly the values w = yı and w = 7. As 
the distance |y2 — 71| grows smaller, it will require a larger value of M for 
the DFT to show two peaks. 


Ex. 4.8 For this exercise, we take yı = —a and y2 = a, for some a in the 
interval (0, 7). Select a value of M that is greater than two and calculate the 
values f(m) form = 1,..., M. Plot the graph of the function |DFTa(w)| on 
(—7,7). Repeat the exercise for various values of M and values of a closer 
to zero. Notice how DFTa(0) behaves as a goes to zero. For each fixed value 
of M there will be a critical value of a such that, for any smaller values of 
a, DFT (0) will be larger than DFTg(a). This is loss of resolution. 


4.12 Complications 


In the real world, of course, things are not so simple. In most appli- 
cations, the data comes from measurements, and so contains errors, also 
called noise. The noise terms that appear in each dm are usually viewed as 
random variables, and they may or may not be independent. If the noise 
terms are not independent, we say that we have correlated noise. If we know 
something about the statistics of the noises, we may wish to process the 
data using statistical estimation methods, such as the best linear unbiased 
estimator (BLUE). 
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4.12.1 Multiple Signal Components 


It sometimes happens that there are two or more distinct values of w 
that we seek. For example, suppose the data is 


din = cima us ene, 


for m = 1,..., M, where a and £ are two distinct numbers in the interval 
[0, 277), and we need to find both a and 8. Now the function DFTa(w) will 
be 


M M M 
DFTa(w) = 5 (ema j eP enim = 5 cima p—imw fe 5 eib sonar, 
m=1 m=1 m=1 
so that 
M M 
DFTa(w) = 5 eim(a-w) fe 5 eim(B-w) | 
m=1 m=1 


So the function DFTa(w) is the sum of the DF'Tg(w) that we would have 
obtained separately if we had had only a and only £. 


4.12.2 Resolution 


If the numbers a and 8 are well separated in the interval [0,27) or M 
is very large, the plot of |DF'Td(w)| will show two high values, one near 
w =a and one near w = p. However, if the M is smaller or the a and 6 
are too close together, the plot of |DFTd(w)| may show only one broader 
high bump, centered between a and 8; this is loss of resolution. How close 
is too close will depend on the value of M. 


4.12.3 Unequal Amplitudes and Complex Amplitudes 


It is also often the case that the two signal components, the one from 
a and the one from £, are not equally strong. We could have 


din = Aetna + Be? 


where A > B > 0. In fact, both A and B could be complex numbers, that 
is, A = |Aļe®: and B = |Ble*”, so that 


dm = (Alene 4 | Blew +62 , 


In stochastic signal processing, the A and B are viewed as random variables; 
A and B may or may not be mutually independent. 
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4.12.4 Phase Errors 


It sometimes happens that the hardware that provides the measured 
data is imperfect and instead of giving us the values dm = e’"*, we get 
dm = et™¢+m, Now each phase error dm depends on m, which makes 
matters worse than when we had 6; and @2 previously, neither depending 
on the index m. 


4.13 Undetermined Exponential Models 


In our previous discussion, we assumed that the frequencies were known 
and only the coefficients needed to be determined. The problem was then 
a linear one. It is sometimes the case that we also want to estimate the 
frequencies from the data. This is computationally more difficult and is a 
nonlinear problem. Prony’s method is one approach to this problem. 

The date of publication of [130] is often taken by editors to be a typo- 
graphical error and is replaced by 1995; or, since it is not written in En- 
glish, perhaps 1895. But the 1795 date is the correct one. The mathematical 
problem Prony solved arises also in signal processing, and his method for 
solving it is still used today. Prony’s method is also the inspiration for the 
eigenvector methods described in Chapter 14. 


4.13.1 Prony’s Problem 


Prony considers a function of the form 


N 
faa ae (4.14) 
n=1 


where we allow the a, and the Yn to be complex. If we take the yn, = iwn 
to be imaginary, f(x) becomes the sum of complex exponentials, which 
we discuss later; if we take yn to be real, then f(x) is the sum of real 
exponentials, either increasing or decreasing. The problem is to determine 
the number N, the yn, and the an from samples of f(z). 


4.13.2 Prony’s Method 


Suppose that we have data fm = f(mA), for some A > 0 and for 
m = 1,..., M, where we assume that M = 2N. We seek a vector c with 
entries cj, j = 0, ..., N such that 


Cofk+1 + c1 fk+2 + cofet3 +... + en fk+N+1 = 9, (4.15) 
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for k = 0,1,..., M — N — 1. So, we want a complex vector c in CNHI 
orthogonal to M — N = N other vectors. In matrix-vector notation we are 
solving the linear system 


fi fo ç - fN+l | eo 0 
fo fs = fnẹe| |a 0 
fis fna -© Jm CN 0 


which we write as Fe = 0. Since FiFc = 0 also, we see that c is an 
eigenvector associated with the eigenvalue zero of the hermitian nonnega- 
tive definite matrix FF; here Ft denotes the conjugate transpose of the 
matrix F. 

Fix a value of k and replace each of the f,4; in Equation (4.15) with 
the value given by Equation (4.14) to get 


N N 
n(k+j+1)A 
0 = ) an ) cje?” J+) 
n=1 j=0 
N N 
= Yn(k+1)A (pyn AVI 
= > ape” (FTI) > cj(e7”) 
n=1 j=0 


Since this is true for each of the N fixed values of k, we conclude that the 
inner sum is zero for each n; that is, 


N * 
doei(er*)? = 0, 
j=0 


for each n. Therefore, the polynomial 


N 


C(z) = 5 cz) 


j=0 


has for its roots the N values z = e^., Once we find the roots of this 
polynomial we have the values of e7”4. If the yn are real, they are uniquely 
determined from the values e^, whereas, for non-real yn, this is not the 
case, as we saw when we studied the complex exponential functions. 

Then, we obtain the an by solving a linear system of equations. In prac- 
tice we would not know N so would overestimate N somewhat in selecting 
M. As a result, some of the an would be zero. 
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If we believe that the number N is considerably smaller than M, we do 
not assume that 2N = M. Instead, we select L somewhat larger than we 
believe N is and then solve the linear system 


fi fe e fim] pe 0 
f fs a IREZ . 0 

1 
3 tlk 
fu—-t fmM-t41 = fm k 0 


This system has M — L equations and L + 1 unknowns, so is quite over- 
determined. We would then use the least-squares approach to obtain the 
vector c. Again writing the system as Fc = 0, we note that the matrix 
FÝF is L+1 by L+1 and has \ = 0 for its lowest eigenvalue; therefore, it 
is not invertible. When there is noise in the measurements, this matrix may 
become invertible, but will still have at least one very small eigenvalue. 

Finding the vector c in either case can be tricky because we are look- 
ing for a nonzero solution of a homogeneous system of linear equations. 
For a discussion of the numerical issues involved in these calculations, the 
interested reader should consult the book by Therrien [153]. 
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5.1 Chapter Summary 


An important example of the use of the DFT is the design of direc- 
tional transmitting or receiving arrays of antennas. In this chapter we re- 
visit transmission and remote sensing, this time with emphasis on the roles 
played by complex exponential functions and the DFT. 


5.2 Directional Transmission 


Parabolic mirrors behind car headlamps reflect the light from the bulb, 
concentrating it directly ahead. Whispering at one focal point of an ellip- 
tical room can be heard clearly at the other focal point. When I call to 
someone across the street, I cup my hands in the form of a megaphone to 
concentrate the sound in that direction. In all these cases the transmitted 
signal has acquired directionality. In the case of the elliptical room, not only 
does the soft whispering reflect off the walls toward the opposite focal point, 
but the travel times are independent of where on the wall the reflections 
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occur; otherwise, the differences in time would make the received sound 
unintelligible. Parabolic satellite dishes perform much the same function, 
concentrating incoming signals coherently. In this chapter we discuss the 
use of amplitude and phase modulation of transmitted signals to concen- 
trate the signal power in certain directions. Following the lead of Richard 
Feynman in [72], we use radio broadcasting as a concrete example of the 
use of directional transmission. 

Radio broadcasts are meant to be received and the amount of energy 
that reaches the receiver depends on the amount of energy put into the 
transmission as well as on the distance from the transmitter to the receiver. 
If the transmitter broadcasts a spherical wave front, with equal power in all 
directions, the energy in the signal is the same over the spherical wavefronts, 
so that the energy per unit area is proportional to the reciprocal of the sur- 
face area of the front. This means that, for omni-directional broadcasting, 
the energy per unit area, that is, the energy supplied to any receiver, falls 
off as the distance squared. The amplitude of the received signal is then 
proportional to the reciprocal of the distance. 

Returning to the example we studied previously, suppose that you own 
a radio station in Los Angeles. Most of the population resides along the 
north-south coast, with fewer to the east, in the desert, and fewer still to 
the west, in the Pacific Ocean. You might well want to transmit the radio 
signal in a way that concentrates most of the power north and south. But 
how can you do this? The answer is to broadcast directionally. By shaping 
the wavefront to have most of its surface area north and south you will have 
the broadcast heard by more people without increasing the total energy in 
the transmission. To achieve this shaping you can use an array of multiple 
antennas. 


5.3 Multiple-Antenna Arrays 
5.3.1 The Array of Equi-Spaced Antennas 


We place 2N + 1 transmitting antennas a distance A > 0 apart along 
an east-west axis, as shown in Figure 5.1. For convenience, let the locations 
of the antennas be nA, n = —N,..., N. To begin with, let us suppose that 
we have a fixed frequency w and each of the transmitting antennas sends 
out the same signal f,(t) = — cos(wt). With this normalization the 
total energy is independent of N. Let (x,y) be an arbitrary location on 
the ground, and let s be the vector from the origin to the point (x,y). 
Let 0 be the angle measured clockwise from the positive horizontal axis 
to the vector s. Let D be the distance from (x,y) to the origin. Then, if 
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(x,y) is sufficiently distant from the antennas, the distance from nA on 
the horizontal axis to (x,y) is approximately D — nAcos(@). The signals 
arriving at (x,y) from the various antennas will have traveled for different 
times and so will be out of phase with one another to a degree that depends 
on the location of (x,y). 


FIGURE 5.1: Antenna array and far-field receiver. 


5.3.2 The Far-Field Strength Pattern 


Since we are concerned only with wavefront shape, we omit for now the 
distance-dependence in the amplitude of the received signal. The signal 
received at (x,y) is proportional to 


1 N 
f(s, t) JIN +1 2 cos(w(t Ty tn)), 


where 


t= “(D — nA cos(0)) 
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and c is the speed of propagation of the signal. Writing 
D 
cos(w(t — tn)) = cos (~ (: = 2) +ny cos()) 
c 
for y = sA, we have 
D 
cos(w(t — tn)) = cos ( (: - 2)) cos(n7y cos(0)) 


ai e satel 


Using Equations (4.12) and (4.13), we find that the signal received at (x, y) 
is 


Tenas TO) tas (e (« = 2)) (5.1) 
for 
_ sin((N + 5)ycos(@)) 
ee sin($7cos(0)) 


when the denominator equals zero the signal equals /2N + I cos(w(t—#)). 


5.3.3 Can the Strength Be Zero? 


We see from Equation (5.1) that the maximum power is in the north- 
south direction. What about the east-west direction? In order to have negli- 
gible signal power wasted in the east-west direction, we want the numerator, 
but not the denominator, in Equation (5.1) to be zero when 6 = 0. This 
means that A = mA/(2N + 1), where \ = 27c/w is the wavelength and m 
is some positive integer less than 2N + 1. Recall that the wavelength for 
broadcast radio is tens to hundreds of meters. 


Ex. 5.1 Graph the function H(@) in polar coordinates for various choices 
of N and A. 


Figures 5.2, 5.3, and 5.4 show that transmission pattern H (0) for various 
choices of m and N. In Figure 5.2 N = 5 for each plot and the m changes, 
illustrating the effect of changing the spacing of the array elements. The 
plots in Figure 5.3 differ from those in Figure 5.2 only in that N = 21 now. 
In Figure 5.4 we allow the m to be less than one, showing the loss of the 
nulls in the east and west directions. 
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270 270 


FIGURE 5.2: Transmission Pattern H(@): m = 1,2,4,8 and N =5. 
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270 270 


FIGURE 5.3: Transmission Pattern H(@): m = 1,2,4,8 and N = 21. 
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m=0.9, N= 21 m=0.5, N= 21 
90 50 90 50 
: 60 ; 60 


N30 
180- 10 
21 30 
270 270 
m= 0.25, N=21 m=0.125, N=21 
90 50 90 50 
- 60 120 : 60 


270 270 


FIGURE 5.4: Transmission Pattern H (0): m = 0.9,0.5,0.25,0.125 and 
N =21. 
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5.3.4 Diffraction Gratings 


I have just placed on the table next to me a CD, with the shinier side 
up. Beyond it is a lamp. The CD acts as a mirror, and I see in the CD 
the reflection of the lamp. Every point of the lamp seems to be copied in 
a particular point on the surface of the CD, as if the ambient light that 
illuminates a particular point of the lamp travels only to a single point on 
the CD and then is reflected on into my eye. Each point of the lamp has its 
own special point on the CD. We know from basic optics that that point 
is such that the angle of incidence equals the angle of reflection, and the 
path (apparently) taken by the light beam is the shortest path the light 
can take to get from the lamp to the CD and then on to my eye. But how 
does the light know where to go? 

In fact, what happens is that light beams take many paths from each 
particular point on the lamp to the CD and on to my eye. The reason I 
see only the one path is that all the other paths require different travel 
times, and so light beams on different paths arrive at my eye out of phase 
with one another. Only those paths very close to the one I see have travel 
times sufficiently similar to avoid this destructive interference. Speaking 
a bit more mathematically, if we define the function that associates with 
each path the time to travel along that path, then, at the shortest path, the 
first derivative of this function, in the sense of the calculus of variations, 
is zero. Therefore deviations from the shortest path correspond only to 
second-order changes in travel time, not first-order ones, which reduces the 
destructive interference. 

But, as I look at the CD on the table, I see more than the reflection 
of the lamp. I see streaks of color also. There is a window off to the side 
and the sun is shining into the room through this window. When I place 
my hand between the CD and the window, some of the colored streaks 
disappear, and other colored streaks seem to appear. I am not seeing a 
direct reflection of the sun; it is off to the side. What is happening is that 
the grooves on the surface of the CD are each reflecting sunlight and acting 
as little transmitters. Each color in the spectrum corresponds to a particular 
frequency w of light and at just the proper angle the spacing between the 
grooves on the CD leads to coherent transmission of the reflected light in 
the direction of my eye. The combination of frequency and spacing between 
the grooves determines what color I see and at what angle. When I reach 
over and tilt the CD off the table, the colors of the streaks change, because 
I have changed the spacing of the little transmitters, relative to my eye. 
An arrangement like this is called a diffraction grating and has many uses 
in physics. For a wonderful, and largely math-free, introduction to these 
ideas, see the book by Feynman [71]. 
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5.4 Phase and Amplitude Modulation 


In the previous section the signal broadcast from each of the antennas 
was the same. Now we look at what directionality can be obtained by using 
different amplitudes and phases at each of the antennas. Let the signal 
broadcast from the antenna at nA be 


fn(t) = |An| cos(wt — dn) = |An| cos(w(t — mn )), 


for some amplitude |A,,| > 0 and phase ¢,, = w7,. Now the signal received 
at s is proportional to 


N 


f(s,t)= XO |An|cosWw(t - tn — m)). 


n=—N 


If we wish, we can repeat the calculations done earlier to see what the effect 
of the amplitude and phase changes is. Using complex notation simplifies 
things somewhat. 

Let us consider a complex signal; suppose that the signal transmitted 
from the antenna at nA is gan(t) = |An|e’-7™). Then, the signal received 
at location s is proportional to 


N 
g(st)= > Anetta), 


n=—N 


Then we have 7 
g(s, t) = B0) 0E? 


for An = |An|e~'®, « = 22 cos(0), and 


N 
B(0)= X` Ane. 
n=—N 


Note that the complex amplitude function B(@) depends on our choices of 
N and A and takes the form of a finite Fourier series or DFT. We can design 
B(@) to approximate the desired directionality by choosing the appropri- 
ate complex coefficients A, and selecting the amplitudes |A,| and phases 
gn accordingly. We can generalize further by allowing the antennas to be 
spaced irregularly along the east-west axis, or even distributed irregularly 
over a two-dimensional area on the ground. 
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5.5 Steering the Array 


In our previous discussion, we selected A, = 1 and ¢, = O for all 
n and saw that the maximum transmitted power was along the north- 
to-south axis. Suppose that we want to design a transmitting array that 
maximally concentrates signal power in another direction. Theoretically, we 
could physically rotate or steer the array until it ran along a different axis, 
and then proceed as before, with A, = 1 and ¢, = 0. This is not practical, 
in most cases. There is an alternative, fortunately. We can “steer” the array 


mathematically. 
If A, = 1, and 
nAw 
on = — cos Q, 
c 
for some angle a, then, for x = ga cos(0), we have 


N N 
B(0) = 5 etn? ein = 5 ein “E (cos 0— cos a), 
n=—N n=—N 


The maximum absolute value of B(@) occurs when cos = cosa, or when 
0 = a or 0 = —a. Now the greatest power is concentrated in these di- 
rections. The point here is that we have altered the directionality of the 
transmission, not by physically moving the array of antennas, but by chang- 
ing the phases of the transmitted signals. This approach is sometimes called 
phase steering. The same basic idea applies when we are receiving signals, 
rather than sending them. In radar and sonar, the array of sensors is steered 
mathematically, by modifying the phases of the measured data, to focus 
the sensitivity of the detecting array in a particular direction. 


5.6 Maximal Concentration in a Sector 


In this section we take A = ££, so that sa = 7. Suppose that we want 
to concentrate the transmitted power in the directions 0 corresponding 
to x = ““cos(@) in the sub-interval [a,b] of the interval [-24, 28]. Let 
u = (A_n,..., An)? be the vector of coefficients for the function 


N 
B(x) = 5 Ae mna, 
n=—N 


We want |B(a)| to be concentrated in the interval a < x < b. 
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Ex. 5.2 Show that 


wA 
i aes as 
— |B(x)|?dz = utu, 
27 J_ wa 


and 
1 f 3 
= | IB@Pdz = utQu, 


where Q is the matrix with entries 


b 
Qmn = F exp(i(m — n)ax) dz. 


Maximizing the concentration of power within the interval [a,b] is then 
equivalent to finding the vector u that maximizes the ratio u'Qu/u'u. 
The matrix Q is positive-definite, all its eigenvalues are positive, and the 
optimal u is the eigenvector of Q associated with the largest eigenvalue. 
This largest eigenvalue is the desired ratio and is always less than one. As 
N increases this ratio approaches one, for any fixed sub-interval [a, b]. 


5.7 Scattering in Crystallography 


When x-rays are passed through certain materials they are scattered, 
which means retransmitted in various directions. As W. L. Bragg discov- 
ered, by analyzing the distinctive pattern of the scattering the molecular 
structure of the material can be determined. This technique was used by 
Rosalind Franklin, a physicist at King’s College, London, to analyze DNA 
and her work contributed greatly to the discovery, by Francis Crick and 
James Watson, of the double-helix structure of that molecule. 

In 1964 the British scientist Dorothy Hodgkin won the Nobel Prize for 
her extension of this technique to reveal the structure of compounds more 
complex than any previously analyzed. Her most important work was on 
the structure of cholesterol, vitamin D, penicillin, vitamin B12, and insulin, 
where she was able to uncover, by physical methods, chemical features not 
encountered before, and thereby to extend the bounds of chemistry itself. 
One of Dorothy Hodgkin’s students at Oxford was Margaret Roberts, later 
Margaret Thatcher, Prime Minister of Great Britain throughout the 1980’s. 

In [101] Korner reveals how surprised he was when he heard that 
large amounts of computer time are spent by crystallographers computing 
Fourier transforms numerically. He goes on to describe this application. 
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The structure to be analyzed consists of some finite number of particles 
that will scatter in all directions any electromagnetic radiation that hits 
them. A beam of monochromatic light with unit strength and frequency w 
is sent into the structure and the resulting scattered beams are measured 
at some number of observation points. 

We say that the scattering particles are located in space at the points 
tm, M = 1,...,M, and that the incoming light arrives as a planewave with 
wavevector kg. Then the planewave field generated by the incoming light 
is 

g(s t) = eit pikes 
What is received at each rm is then 
g(tm, t) = ettei rm, 
We observe the scattered signals at s, where the retransmitted signal com- 
ing from rm is 
f(s, t) = eivt eik Em eills-rmll, 


When s is sufficiently remote from the scattering particles, the retransmit- 
ted signal from rm arrives at s as a planewave with wavevector 


km = =(8 —Fm)/l|8 — rml- 


Therefore, at s we receive 


is M ¥ 
u(s, t) = e™* Xo etkm's| 
m=1 


The objective is to determine the km, which will then tell us the lo- 
cations rm of the scattering particles. To do this, we imagine an infinity 
of possible locations r for the particles and define a(r) = 1 if r = rm for 
some m, and a(r) = 0 otherwise. More precisely, we define a(r) as a sum of 
unit-strength Dirac delta functions supported at the rm, a topic we shall 
deal with later. At each r we obtain (in theory) a value of the function 
A(k), the Fourier transform of the function a(r). 

In practice, the crystallographers cannot measure the complex numbers 
A(k), but only the magnitudes |A(k)|; the phase angle of A(k) is lost. This 
presents the crystallographers with the phase problem, in which we must 
estimate a function from values of the magnitude of its Fourier transform. 
For a detailed discussion of the phase problem see Chapter 10. 

In 1985, Hauptman and Karle won the Nobel Prize in Chemistry for 
developing a new method for finding a(s) from measurements. Their tech- 
nique is highly mathematical. It is comforting to know that, although there 
is no Nobel Prize in Mathematics, it is still possible to win the prize for 
doing mathematics. 


Chapter 6 


The Fourier Transform and 
Convolution Filtering 


6.1 ChaptersSummary: isc. sateen iced tee eae a wei ee eens ss 
6.2 Linear Filters s.2aiatatatenet eiad tte A E dane levaicel ois dane he 
6.3 Shift-Invariant Filters 2.0.0.0... 0... ccc cece cece enn teen ees 
6.4 Some Properties of a SILO ..... 2... eee cece een eee teens 
6.5 The Dirac Delta .......... 2... cece cece ccc eee eee e eet e EE 
6.6 The Impulse-Response Function ............... 0 cece eee eee eee 
6.7 Using the Impulse-Response Function ................e cece eee 
6.8 The Filter Transfer Function .......... 00... eee eee cece een eee 
6.9 The Multiplication Theorem for Convolution ................... 
6.10; “Summing Up ato ceiey ate hot plants bid lente eins baat aed 
Gell. “A Question tcsec.ctadateastaserueeadaensienianieadawereieeeendwals 
6:12" 2 Band=Wimitine 4.42443 cui tacsunt telus renos toni doente 


6.1 Chapter Summary 


A major application of the Fourier transform is in the study of systems. 
We may think of a system as a device that accepts functions as input 
and produces functions as output. For example, the differentiation system 
accepts a differentiable function f(a) as input and produces its derivative 
function f'(x) as output. If the input is the function f(x) = 5fı(x)+3f2(x), 
then the output is 5fi(x) + 3f$(x); the differentiation system is linear. 
We shall describe systems algebraically by h = Tf, where f is any input 
function, h is the resulting output function from the system, and T is the 
operator that represents the operation performed by the system on any 
input. For the differentiation system we would write the differentiation 


operator as Tf = f’. 
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6.2 Linear Filters 


The system operator T is linear if 


T(afi + bf2) = aT (fi) + 6T (fe), 


for any scalars a and b and functions fı and f2. We shall be interested only 
in linear systems. 


6.3 Shift-Invariant Filters 


We denote by Sa the system that shifts an input function by a; that is, 
if f(x) is the input to system Sa, then f(x — a) is the output. A system 
operator T is said to be shift-invariant if 


T(Sa(f)) = SalT(F)), 


which means that, if input f(x) leads to output h(x), then input f(x — a) 
leads to output h(a — a); shifting the input just shifts the output. When 
the variable x is time, we speak of time-invariant systems. When T is a 
shift-invariant linear system operator we say that T is a SILO. 


6.4 Some Properties of a SILO 
We show first that (T fY = Tf’. Suppose that h(x) = (T f)(x). For any 


Ax we can write 
f(a + Ax) = (S_azf)(2) 
and 
(T'$_axf)() = (S-aeT f)(x) = (S-ash)(2) = h(w + Az). 
When the input to the system is 


+ (Fe + Ax) - f(2)), 


the output is 


(ne + Az) — h(x). 


The Fourier Transform and Convolution Filtering 105 


Now we take limits, as Ax — 0, so that, assuming continuity, we can 
conclude that Tf’ = h’. We apply this now to the case in which f(z) = 
e‘*” for some real constant w. 

Since f'(x) = —iwf(x) and f(x) = +f’(x) in this case, we have 


h(a) = (TA) = ŻT PE) = =H, 
so that 
h'(x) = —iwh(z). 


Solving this differential equation, we obtain 
h(x) = ce, 


for some constant c. Note that since the c may vary when we vary the 
selected w, we must write c = c(w). The main point here is that, when T is 
a SILO and the input function is a complex exponential with frequency w, 
then the output is again a complex exponential with the same frequency 
w, multiplied by a complex number c(w). This multiplication by c(w) only 
modifies the amplitude and phase of the exponential function; it does not 
alter its frequency. So SILOs do not change the input frequencies, but only 
modify their strengths and phases. 


Ex. 6.1 Let T be a SILO. Show that T is a convolution operator by show- 
ing that, for each input function f, the output function h = Tf is the 
convolution of f with g, where g(x) is the inverse FT of the function c(w) 
obtained above. Hint: Write the input function f(x) as 


fla) = 5 | Fedu, 


5a 


and assume that 


(TA\(e) = — f * Pw) (Te) da, 


QTR) ies 


Now that we know that a SILO is a convolution filter, the obvious 
question to ask is What is g(x)? This is the system identification problem. 
One way to solve this problem is to consider what the output is when the 
input is the Heaviside function u(x). In that case, we have 


na) = f wly)ale—v)dy= f ole — way = f otat 


—oo 0 —oo 


Therefore, h'(x) = g(x). 
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6.5 The Dirac Delta 


The Dirac delta, denoted 6(x), is not truly a function. Its job is best 
described by its sifting property: for any fixed value of x, 


fla) = i Faile — y)dy. 


In order for the Dirac delta to perform the sifting operator on any f(x) it 
would have to be zero, except at x = 0, where it would have to be infinitely 
large. It is possible to give a rigorous treatment of the Dirac delta, using 
generalized functions, but that is beyond the scope of this course. The Dirac 
delta is useful in our discussion of filters, which is why it is used. 


6.6 The Impulse-Response Function 


We can solve the system identification problem by seeing what the out- 
put is when the input is the Dirac delta; as we shall see, the output is g(x); 
that is, T = g. Since the SILO T is a convolution operator, we know that 


nx) = f E pay Say 


For this reason, the function g(x) is called the impulse-response function 
of the system. 


6.7 Using the Impulse-Response Function 


Suppose now that we take as our input the function f(x), but write it 
as 


f(x) = T Tones 


Then, since T is linear, and the integral is more or less a big sum, we have 


T(f)\(2) = / FOTOE- y))dy = / foe vay: 
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The function on the right side of this equation is the convolution of the 
functions f and g, written f x g. This shows, as we have seen, that T 
does its job by convolving any input function f with its impulse-response 
function g, to get the output function h = Tf = f xg. It is useful to 
remember that order does not matter in convolution: 


one — y)dy = [wre — y)dy. 


6.8 The Filter Transfer Function 


Now let us take as input the complex exponential f(x) = e7 
w is fixed. Then the output is 


n(x) =T) = | ega- y)dy = | g(ye "dy = Cw), 


izw where 


where G(w) is the Fourier transform of the impulse-response function g(x); 
note that G(w) = c(w) from Exercise 6.1. This tells us that when the input 
to T is a complex exponential function with “frequency” w, the output is 
the same complex exponential function, the “frequency” is unchanged, but 
multiplied by a complex number G(w). This multiplication by G(w) can 
change both the amplitude and phase of the complex exponential, but the 
“frequency” w does not change. In filtering, this function G(w) is called the 
transfer function of the filter, or sometimes the frequency-response function. 


6.9 The Multiplication Theorem for Convolution 


Now let’s take as input a function f(x), but now write it, using Equation 
(2.7), as 


f(x) = ae Plea. 


Then, taking the operator inside the integral, we find that the output is 


hle) =T) = = | FUTE) = 5 


D- or e**” F(w)G(w)dw. 
But, from Equation (2.7), we know that 


h(x) = oe | Hw). 
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This tells us that the Fourier transform H(w) of the function h = f * g is 
simply the product of F(w) and G(w); this is the most important property 
of convolution. 


6.10 Summing Up 


It is helpful to take stock of what we have just discovered: 
1. if h =T(f) then h’ = T(f’); 
2. Te") =Gwe-™: 


3. writing i ae 
Flo) = = f Podo, 
we obtain 
h(x) = (TAE) = = | FOTE) 
so that 
h(a) = ae | FOG; 


4. since we also have 


h(x) = ae | Hoes, 


we can conclude that H (w) = F(w)G(w); 
5. if we define g(x) to be (T5)(x), then 


Writing 


we get 


h(a) = (THE) = J T — y)dy = y Tonea, 


so that h is the convolution of f and g; 


6. g(x) is the inverse Fourier transform of G (w). 
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6.11 A Question 


Previously, we allowed the operator T to move inside the integral. We 
know, however, that this is not always permissible. The differentiation op- 
erator T = D, with D(f) = f’, cannot always be moved inside the integral; 
as we learn in advanced calculus, we cannot always differentiate under the 
integral sign. This raises the interesting issue of how to represent the differ- 
entiation operator as a shift-invariant linear filter. In particular, what is the 
impulse-response function? The answer will involve the problem of differ- 
entiating the delta function, the Green’s Function method for representing 
the inversion of linear differential operators, and generalized functions or 
distributions. 


6.12 Band-Limiting 


Suppose that G(w) = yo(w). Then, if F(w) is the Fourier transform of 
the input function, the Fourier transform of the output function h(t) will 


be 
s <Q.: 
Hw) = [FO Ells 9; 
0, if jw] > Q. 


The effect of the filter is to leave values F'(w) unchanged, if |w| < Q, and to 
replace F(w) with zero, if |w| > Q. This is called band-limiting. Since the 
inverse Fourier transform of G(w) is 


the band-limiting system can be described using convolution: 


h(t) = fro as. 


a(t — s) 
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7.1 Chapter Summary 


Many textbooks on signal processing present filters in the context of 
infinite sequences. Although infinite sequences are no more realistic than 
functions f(t) defined for all times t, they do simplify somewhat the discus- 
sion of filtering, particularly when it comes to the impulse response and to 
random signals. Systems that have as input and output infinite sequences 
are called discrete systems. 


7.2 Shifting 


We denote by f = {fn} ə an infinite sequence. For a fixed integer 
k, the system that accepts f as input and produces as output the shifted 
sequence h = {hn = fn-k} is denoted Sp; therefore, we write h = Sk f. 


111 
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7.3 Shift-Invariant Discrete Linear Systems 


A discrete system T is linear if 
T(af’ + bf?) = aT(f*) +bT(f*), 


for any infinite sequences f! and f? and scalars a and b. As previously, a 
system T is shift-invariant if TS; = SkT. This means that if input f has 
output h, then input S;f has output Skh; shifting the input by k just shifts 
the output by k. 


7.4 The Delta Sequence 


The delta sequence 6 = {ôn} has ĝo = 1 and 6, = 0, for n not equal to 
zero. Then 5;,(0) is the sequence Sp(ô) = {n-p}. For any sequence f we 


have P sc 
fn = 5 Imin- m= 5 Oomfn—m: 


m=— oo m=— oo 


This means that we can write the sequence f as an infinite sum of the 
sequences Smô: 


f= X fmSmlô). (7.1) 


As in the continuous case, we use the delta sequence to understand better 
how a shift-invariant discrete linear system T works. 


7.5 The Discrete Impulse Response 


We let 6 be the input to the shift-invariant discrete linear system T, 
and denote the output sequence by g = T(6). Now, for any input sequence 
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f with h = T(f), we write f using Equation (7.1), so that 


riy=7( 5 fnsnd) = XO fmTSm(5) 


5 fin SmT (6) = 5 fmSm(9). 


m=— oo m=— oo 


> 
lI 


II 


Therefore, we have 


Co 


hn = 5 fim9n—ms (7.2) 


m=— oo 


for each n. Equation (7.2) is the definition of discrete convolution or the 
convolution of sequences. This tells us that the output sequence h = T (f) is 
the convolution of the input sequence f with the impulse-response sequence 
g; that is, h=T(f) =f *g. 


7.6 The Discrete Transfer Function 


Associated with each w in the interval [0,27) we have the sequence 
ey = {fe}: the minus sign in the exponent is just for notational 


n=— 00) 
convenience later. What happens when we let f = eu be the input to the 
system T? The output sequence h will be the convolution of the sequence 
ew with the sequence g; that is, 


oo oo oo 
hn = 5 a AR pE 5 gme OTY = eT iw 5 gme”. 
m=—oo m=—oo m=—oo 
Defining 
o0 
Gwy= XO one (7.3) 
m=—oo 
for 0 < w < 27, we can write 


hn = e "®G(w), 


or 
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This tells us that when ew is the input, the output is a multiple of the 
input; the “frequency” w has not changed, but the multiplication by G(w) 
can alter the amplitude and phase of the complex-exponential sequence. 

Notice that Equation (7.3) is the definition of the Fourier series asso- 
ciated with the sequence g viewed as a sequence of Fourier coefficients. It 
follows that, once we have the function G(w), we can recapture the original 
gn from the formula for Fourier coefficients: 


1 27 


Gn = z f G(w)e7"” dw. 


7.7 Using Fourier Series 


For any sequence f = { fn}, we can define the function 


Fw)= $, fhe, 
for w in the interval [0,27). Then each fn is a Fourier coefficient of F (w) 
and we have 


1 2m : 
ie eee F — IRU ; 
Í. 7 i (w)e dw 
It follows that we can write 
1 2m 
f= — F(w)ewdw. (7.4) 
2m 0 


We interpret this as saying that the sequence f is a superposition of the 
individual sequences ew, with coefficients F (w). 


7.8 The Multiplication Theorem for Convolution 


Now consider f as the input to the system T, with h = T(f) as output. 
Using Equation (7.4), we can write 


kerje i E F(w)eudw) 
= = z F(w)T (e,)dw = H 3 F(w)G(w)esdw. 
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But, applying Equation (7.4) to h, we have 


1 2T 
h= — H(w)e,dw. 

27 Jo 
It follows that H(w) = F(w)G(w), which is analogous to what we found 
in the case of continuous systems. This tells us that the system T works 
by multiplying the function F'(w) associated with the input by the transfer 
function G(w), to get the function H(w) associated with the output h = 
T(f). In the next section we give an example. 


7.9 The Three-Point Moving Average 


We consider now the linear, shift-invariant system T that performs the 
three-point moving average operation on any input sequence. Let f be any 
input sequence. Then the output sequence is h with 


1 
hn = zn- F Tn + fati): 


The impulse-response sequence is g with g-1 = go = g1 = 3, and gn = 0, 
otherwise. 

To illustrate, for the input sequence with f, = 1 for all n, the output 
is hn = 1 for all n. For the input sequence 


f ={...,3,0, 0, 3,0,0,...}, 


the output h is again the sequence hp = 1 for all n. If our input is 
the difference of the previous two input sequences, that is, the input is 
{...,2, -1,-1, 2, -1,-1,...}, then the output is the sequence with all en- 
tries equal to zero. 

The transfer function G(w) is 


; ; 1 
(e“+1l+e“)= 30 +2cosw). 


The function G(w) has a zero when cosw = —4, or when w = 2 orw = 4. 
Notice that the sequence given by 


2 


Pe On 2 
fn = (ci + ee) = 2 cos =n 


is the sequence {...,2,—1,—1,2,—1,—1,...}, which, as we have just seen, 
has as its output the zero sequence. We can say that the reason the output 
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is zero is that the transfer function has a zero at w = = and at w = ir = 
=a, Those complex-exponential components of the input sequence that 
correspond to values of w where G(w) = 0 will be removed in the output. 
This is a useful role that filtering can play; we can null out an undesired 
complex-exponential component of an input signal by designing G(w) to 
have a root at its frequency w. 


7.10 Autocorrelation 
If we take the input to our convolution filter to be the sequence f related 
to the impulse-response sequence by 
fn = Gn: 


then the output sequence is h with entries 


+00 
hn = > GkGk—n 


k=—0oo 


and H(w) = |G(w)|?. The sequence h is called the autocorrelation sequence 
for g and |G(w)|? is the power spectrum of g. 

Autocorrelation sequences have special properties not shared with or- 
dinary sequences, as the exercise below shows. The Cauchy Inequality is 
valid for infinite sequences: with the length of g defined by 


+00 


loll=( Yo lon?) 


n=— Co 


and the inner product of any sequences f and g given by 


+00 
(f,9) = 5 fnUns 


n=—Co 


we have 
KE 9)1 < IFI Mall, 


with equality if and only if g is a constant multiple of f. 


Ex. 7.1 Let h be the autocorrelation sequence for g. Show that h-n = hn 
and ho > |hn]| for all n. 
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7.11 Stable Systems 


An infinite sequence f = {fn} is called bounded if there is a constant 

A > 0 such that |fn| < A, for all n. The shift-invariant linear system with 
impulse-response sequence g = T'(d) is said to be stable [120] if the output 
sequence h = {hn} is bounded whenever the input sequence f = {fn} is. In 
Exercise 7.2 below we ask the reader to prove that, in order for the system 
to be stable, it is both necessary and sufficient that 

co 

XO Ign] < +00. 

n=—Cco 


Given a doubly infinite sequence, g = {9n} Z ~, we associate with g its 


z-transform, the function of the complex variable z given by 


+00 
G(z) = 5 Gaz PS 


n=— oo 


Doubly infinite series of this form are called Laurent series and occur in 
the representation of functions analytic in an annulus. Note that if we take 
z =e ™ then G(z) becomes G(w) as defined by Equation (7.3). The z- 
transform is a somewhat more flexible tool in that we are not restricted to 
those sequences g for which the z-transform is defined for z = e~™. 


Ex. 7.2 Show that the shift-invariant linear system with impulse-response 
sequence g is stable if and only if 


+oo 


5 lgn| < +00. 


n=—0oo 
Hint: If, on the contrary, 
+00 


be |In] = +00, 


n=—Co 


consider as input the bounded sequence f with 


fn = Fn/|9-nl 
and show that the output ho = +00. 


Ex. 7.3 Consider the linear system determined by the sequence go = 2, 
gn = (4)"l, for n # 0. Show that this system is stable. Calculate the z- 
transform of {gn} and determine its region of convergence. 
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7.12 Causal Filters 


The shift-invariant linear system with impulse-response sequence g is 
said to be a causal system if the sequence {gn} is itself causal; that is, 
gn = 0 for n < 0. For causal systems the value of the output at n, that 
is, hn, depends only on those input values fm for m < n. When the input 
is a time series, this says that the value of the output at any given time 
depends only on the value of the inputs up to that time, and not on future 
values of the input sequence. A number of important filters, such as band- 
limiting filters, are not causal and have to be approximated by causal filters 
to operate in real time. 


Ex. 7.4 Show that the function G(z) = (z — zo) is the z-transform of a 
causal sequence g, where zo is a fixed complex number. What is the region 
of convergence? Show that the resulting linear system is stable if and only 
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8.1 Chapter Summary 


Convolution is an important concept in signal processing and occurs 


in several distinct contexts. The simplest example of convolution is the 
nonperiodic convolution of finite vectors, which is what we do to the co- 
efficients when we multiply two polynomials together. In Chapters 6 and 
7 we considered the convolution of functions of a continuous variable and 
of infinite sequences. The reader may also recall an earlier encounter with 
convolution in a course on differential equations. In this chapter we shall 
discuss nonperiodic convolution and periodic convolution of vectors, with 
particular emphasis on the role of the vector DFT and the FFT algorithm. 
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8.2 Nonperiodic Convolution 


Recall the algebra problem of multiplying one polynomial by another. 
Suppose 


A(z) = ao + aya +...taya™ 


and 


B(x) = bo + biz t+... + bya”. 
Let C(x) = A(x)B(x). With 


C(x) = co + ag +... + emyn TY, 

each of the coefficients c;, j = 0,..., M +N, can be expressed in terms of the 
am and bn (an easy exercise!). The vector c€ = (co,...,caz+n) is called the 
nonperiodic convolution of the vectors a = (ao, ..., dz) and b = (bo, ..., by). 
Nonperiodic convolution can be viewed as a particular case of periodic 
convolution, as we shall see. 


8.3 The DFT as a Polynomial 


Given the complex numbers fo, f1,..., fn~—1, we form the vector f = 
(fo, fis; fv-1)2. The DFT of the vector f is the function 


N-1 
DFT;y(w) = X. fre”, 


n=0 


defined for w in the interval [0,27). Because e’” = (e’”)”", we can write 
the DFT as a polynomial 


N-1 


DFT;(w) = X fal). 


n=0 


If we have a second vector, say d = (do,di,...,dn—1)’, then we define 
DFTa(w) similarly. When we multiply DF'T;(w) by DF'Ta(w), we are mul- 
tiplying two polynomials together, so the result is a sum of powers of the 
form 


co + cre™ + cn (e)? +... + cayal N? (8.1) 
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for 
Cj = fod; + fidj—1 See fido. 


This is nonperiodic convolution again. In the next section, we consider what 
happens when, instead of using arbitrary values of w, we consider only the 
N special values wp = 2tk, k = 0,1,...,N — 1. Because of the periodicity 
of the complex exponential function, we have 


(ee \N+5 = (ete J, 


for each k. As a result, all the powers higher than N — 1 that showed up in 
the previous multiplication in Equation (8.1) now become equal to lower 
powers, and the product now only has N terms, instead of the 2N — 1 terms 
we got previously. When we calculate the coefficients of these powers, we 
find that we get more than we got when we did the nonperiodic convolution. 
Now what we get is called periodic convolution. 


8.4 The Vector DFT and Periodic Convolution 


As we just discussed, nonperiodic convolution is another way of look- 
ing at the multiplication of two polynomials. This relationship between 
convolution on the one hand and multiplication on the other is a funda- 
mental aspect of convolution. Whenever we have a convolution we should 
ask what related mathematical objects are being multiplied. We ask this 
question now with regard to periodic convolution; the answer turns out to 
be the vector discrete Fourier transform (vDFT). 


8.4.1 The Vector DFT 


Let £ = (fo, fi,--;,fn—1)’ be a column vector whose entries are N 
arbitrary complex numbers. For k = 0,1,..., N — 1, we let 


N-1 
Fe = XO fne ™™™N = DFT; (wy). (8.2) 
n=0 


Then we let F = (Fo, Fi, ..., Fy—1)? be the column vector with the N com- 
plex entries Fk. The vector F is called the vector discrete Fourier transform 
of the vector f, and we denote it by F = vDFT;. 

The entries of the vector F = vDFT¢ are N equi-spaced values of the 
function DF'T¢(w). If the Fourier transform F'(w) is zero for w outside the 
interval [0,27], and fn = f(n), for n = 0,1,..., N — 1, then the entries of 
the vector F are N estimated values of F'(w). 
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Ex. 8.1 Let fn be real, for each n. Show that Fy-k = Fy, for each k. 


As we can see from Equation (8.2), there are N multiplications involved 
in the calculation of each Fk, and there are N values of k, so it would seem 
that, in order to calculate the vector DFT of f, we need N? multiplications. 
In many applications, N is quite large and calculating the vector F using 
the definition would be unrealistically time-consuming. The fast Fourier 
transform algorithm (FFT), to be discussed later, gives a quick way to 
calculate the vector F from the vector f. The FFT, usually credited to 
Cooley and Tukey, was discovered in the mid-1960’s and revolutionized 
signal and image processing. 


8.4.2 Periodic Convolution 


Given the N by 1 vectors f and d with complex entries fn and dn, 
respectively, we define a third N by 1 vector f « d, the periodic convolution 
of f and d, to have the entries 


(£ * d)n = fodn + fidn—1 +... + fndo + fn+1dN-1 +.. + fn—-idns1, (8.3) 


for n = 0,1,..., N — 1. 

Notice that the term on the right side of Equation (8.3) is the sum of 
all products of entries, one from f and one from d, where the sum of their 
respective indices is either n or n+ N. Periodic convolution is illustrated in 
Figure 8.1. The first exercise relates the periodic convolution to the vector 
DFT. 

In the exercises that follow we investigate properties of the vector DFT 
and relate it to periodic convolution. It is not an exaggeration to say that 
these two exercises are the most important ones in signal processing. The 
first exercise establishes for finite vectors and periodic convolution a version 
of the multiplication theorems we saw earlier for continuous and discrete 
convolution. 


Ex. 8.2 Let F =vDFT; and D = vDFTa. Define a third vector E having 
for its kth entry Ex = Fk Dk, for k = 0, ..., N — 1. Show that E is the vDFT 
of the vector f x d. 


The vector vDFT¢ can be obtained from the vector f by means of 
matrix multiplication by a certain matrix G, called the DFT matriz. The 
matrix G has an inverse that is easily computed and can be used to go 
from F = vDFT¢ back to the original f. The details are in Exercise 8.3. 


Convolution and the Vector DFT 123 


Periodic Convolution 


Multiply and add 


a *b(0) = a(0)b(0) + a(1)b(3) + a(2)b(2) + a(3)b(1) 


Rotate inner disk 
clockwise 


a * b(1) = a(0)b(1) + a(1)b(0) + a(2)b(3) + a(3)b(2) 


FIGURE 8.1: Periodic convolution of vectors a = (a(0), a(1), a(2), a(3)) 
and b = (b(0), (1), (2), 6(3)). 


Ex. 8.3 Let G be the N by N matrix whose entries are 


Cie = eID) (k-1)20/N- 


The matrix G is sometimes called the DFT matriz. Show that the inverse 
of G is G7! = EG, where G? is the conjugate transpose of the matriz G. 
Then f x d = G7 !E = AGE. 


Every time I have taught this subject I have told my students that, if 
they learn nothing else in the course, they should understand the previous 
two exercises, which are fundamental in signal processing. 
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8.5 The vDFT of Sampled Data 


For a doubly infinite sequence {fn| — co < n < co}, the function of 
F(q) given by the infinite series 


Foy. Se” (8.4) 


n=—Co 


is sometimes called the discrete-time Fourier transform (DTFT) of the 
sequence, and the fn are called its Fourier coefficients. The function F'(y) 
is 27-periodic, so we restrict our attention to the interval 0 < y < 2r. If 
we start with a function F(y), for 0 < y < 27, we can find the Fourier 
coefficients by 


1 27 : 
fn = — F(y)e "dy. (8.5) 


= OT 0 


8.5.1 Superposition of Sinusoids 


Equation (8.5) suggests a model for a function of a continuous variable 


1 


~ On 


f(z) | " Pedy. 


The values fn then can be viewed as fn = f(n), that is, the fn are sampled 
values of the function f(x), sampled at the points x = n. The function 
F'(y) is now said to be the spectrum of the function f(x). The function 
f(a) is then viewed as a superposition of infinitely many simple functions, 
namely the complex exponentials or sinusoidal functions e~*”, for values 
of y that lie in the interval [0,27]. The relative contribution of each e~’7” 
to f(x) is given by the complex number + F(7). 


8.5.2 Rescaling 


In the model just discussed, we sampled the function f(a) at the points 
x =n. In applications, the variable x can have many meanings. In partic- 
ular, x is often time, denoted by the variable t. Then the variable y will 
be related to frequency. Depending on the application, the frequencies in- 
volved in the function f(t) may be quite large numbers, or quite small ones; 
there is no reason to assume that they will all be in the interval [0,27]. For 
this reason, we have to modify our formulas. 
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Suppose that the function g(t) is known to involve only frequencies in 
the interval [0, 2]. Define f(x) = g(xA), so that 


g(t) = FEA = E | Fie Aan, 


Introducing the variable w = y/A, and writing G(w) = AF (wA), we get 


1% a 
= — G(lw)je™ "dw. 
2T 0 


Now the typical problem is to estimate G(w) from measurements of g(t). 
Note that, using Equation (8.4), the function G(w) can be written as fol- 
lows: 


G(w) = AF(wA) =A a Terre. 


n=—Cco 


so that 


=A er Jeau (8.6) 


n=— Co 


Note that this is what Shannon’s Sampling Theorem tells us, and shows 
that the functions G(w) and g(t) can be completely recovered from the 
infinite sequence of samples {g(nA)}, whenever G(w) is zero outside an 
interval of total length 2r, 


8.5.3 The Aliasing Problem 


In the previous subsection, we assumed that we knew that the only 
frequencies involved in g(t) were in the interval [0, 2], and that A was our 
sampling spacing. Notice that, given our data g(nA), it is impossible for 
us to distinguish a frequency w from w + Zak, for any integer k: for any 
integers k and n we have 


eilwt K nA — eiwnd eTikn, 


8.5.4 The Discrete Fourier Transform 


In practice, we will have only finitely many measurements g(nA); even 
these will typically be noisy, but we shall overlook this for now. Suppose 
our data is g(nA), for n = 0,1,..., N — 1. For notational simplicity, we let 
fn = g(nA). It seems reasonable, in this case, to base our estimate G(w) 
of G(w) on Equation (8.6) and write 


N-1 
GWw) =A SS g(nAje”. (8.7) 


n=0 
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We shall call G(w) the DFT estimate of the function G(w) and write 
Gorr(w) = Glw); 

it will be clear from the context that the DFT uses samples of g(t) and 


estimates G(w). 


8.5.5 Calculating Values of the DFT 


Suppose that we want to evaluate this estimate of G(w) at the N — 1 
points wk = oak for k = 0,1,..., N — 1. Then we have 


x Net N-1 
lwp) = A 5 g(nA)eP) #£ = Ag(nd)e2t*n/N 
n=0 n=0 


Notice that this is the vector DFT entry Fp for the choices fn = Ag(nA). 

To summarize, given the samples g(nA), for n = 0,1,...,N — 1, we 
can get the N values G(3zk) by taking the vector DFT of the vector 
f = (Ag(0), Ag(A), ..., Ag((N — 1)A))?. We would normally use the FFT 
algorithm to perform these calculations. 


8.5.6 Zero-Padding 


Suppose we simply want to graph the DFT estimate Gprr(w) = G(w) 
on some uniform grid in the interval [0, 24], but want to use more than N 
points in the grid. The FFT algorithm always gives us back a vector with 
the same number of entries as the one we begin with, so if we want to get, 
say, M > N points in the grid, we need to give the FFT algorithm a vector 
with M entries. We do this by zero-padding, that is, by taking as our input 
to the FFT algorithm the M by 1 column vector 


f = (Ag(0), Ag(A), ..., Ag((N — 1)A), 0,0, ...,0)”. 
The resulting vector DFT F then has the entries 


N-1 
F, =A 5 gn A) e? M, 


n=0 


for k = 0, 1, ..., M — 1; therefore, we have Fi, = Ĝ(2rk/M). 


8.5.7 What the vDFT Achieves 


It is important to note that the values Fk we calculate by applying the 
FFT algorithm to the sampled data g(nA) are not values of the function 
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G(w), but of the estimate, G(w). Zero-padding allows us to use the FFT to 
see more of the values of G(w). It does not improve resolution, but simply 
shows us what is already present in the function G(w), which we may not 
have seen without the zero-padding. The FFT algorithm is most efficient 
when N is a power of two, so it is common practice to zero-pad f using as 
M the smallest power of two not less than N. 


8.5.8 Terminology 


In the signal processing literature no special name is given to what we 
call here Gprr(w), and the vector DFT of the data vector is called the DFT 
of the data. This is unfortunate, because the function of the continuous 
variable given in Equation (8.7) is the more fundamental entity, the vector 
DFT being merely the evaluation of that function at N equi-spaced points. 
If we should wish to evaluate the Gprr(w) at M > N equi-spaced points, 
say, for example, for the purpose of graphing the function, we would zero- 
pad the data vector, as we just discussed. The resulting vector DFT is not 
the same vector as the one obtained prior to zero-padding; it is not even 
the same size. But both of these vectors have, as their entries, values of the 
same function, Gprr(w). 


8.6 Understanding the Vector DFT 


Let g(t) be the signal we are interested in. We sample the signal at 
the points t = nA, for n = 0,1,...,N — 1, to get our data values, which 
we label fn = g(nA). To illustrate the significance of the vector DFT, we 
consider the simplest case, in which the signal g(t) we are sampling is a 
single sinusoid. 

Suppose that g(t) is a complex exponential function with frequency the 
negative of wm = 27m/NA; the reason for the negative is a technical one 
that we can safely ignore at this stage. Then 

g(t) = ce t2nm/NA)t 


for some nonnegative integer 0 < m < N — 1. Our data is then 
Tr = Ag(nA) = Ae term /NA)nA = Aen 2rimn{N | 


Now we calculate the components Fk of the vector DFT. We have 


N-1 N-1 
F, = 5 ferrin. =i 5 e2ti(k—m)/N | 
n=0 


= n=0 


128 Signal Processing: A Mathematical Approach 


If k = m, then Fm = NA, while, according to Equation 4.5, Fk = 0, for k 
not equal to m. Let’s try this on a more complicated signal. 
Suppose now that our signal has the form 


N-1 
FO = Y Ame TTNA, (8.8) 
m=0 


The data vector is now 


N-1 


F =A yD Ame MnIN. 


m=0 


The entry Fm of the vector DFT is now the sum of the values it would have 
if the signal had consisted only of the single sinusoid e~*@7™/N4)*, As we 
just saw, all but one of these values would be zero, and so Fm = NAA», 
and this holds for each m = 0,1,...,N — 1. 

Summarizing, when the signal f(t) is a sum of N sinusoids, with the 
frequencies wp = 27k/NA, for k = 0,1,..., N—1, and we sample at t = nA, 
for n = 0,1,..., N — 1, the entries Fẹ of the vector DFT are precisely NA 
times the corresponding amplitudes Ax. For this particular situation, cal- 
culating the vector DFT gives us the amplitudes of the different sinusoidal 
components of f(t). We must remember, however, that this applies only 
to the case in which f(t) has the form in Equation (8.8). In general, the 
entries of the vector DFT are to be understood as approximations, in the 
sense discussed above. 

As mentioned previously, nonperiodic convolution is really a special case 
of periodic convolution. Extend the M +1 by 1 vector a toanM+N+1 
by 1 vector by appending N zero entries; similarly, extend the vector b to 
an M +N +1 by 1 vector by appending zeros. The vector c is now the 
periodic convolution of these extended vectors. Therefore, since we have 
an efficient algorithm for performing periodic convolution, namely the Fast 
Fourier Transform algorithm (FFT), we have a fast way to do the periodic 
(and thereby nonperiodic) convolution and polynomial multiplication. 


8.7 The Fast Fourier Transform (FFT) 


A fundamental problem in signal processing is to estimate finitely many 
values of the function F (w) from finitely many values of its (inverse) Fourier 
transform, f(t). As we have seen, the DFT arises in several ways in that 
estimation effort. The Fast Fourier transform (FFT), discovered in 1965 by 
Cooley and Tukey, is an important and efficient algorithm for calculating 


Convolution and the Vector DFT 129 


the vector DFT [58]. John Tukey has been quoted as saying that his main 
contribution to this discovery was the firm and often voiced belief that such 
an algorithm must exist. 


8.7.1 Evaluating a Polynomial 


To illustrate the main idea underlying the FFT, consider the problem of 
evaluating a real polynomial P(x) at a point, say x = c. Let the polynomial 
be 

P(x) = ao + aix + aon? +... + akr”, 
where azg might be zero. Performing the evaluation efficiently by Horner’s 
method, 


P(c) = (((a2xc + a2g—1)c + a2K—2)C + a2K—3)C clase 


requires 2K multiplications, so the complexity is on the order of the degree 
of the polynomial being evaluated. But suppose we also want P(—c). We 
can write 


P(x) = (ao + a22? +... + aoxa?™) + z(a + agar? +... + @ax—12°*—?) 


or 
P(x) = Q(z”) + tR(z’). 


Therefore, we have P(c) = Q(c?) + cR(c?) and P(—c) = Q(c?) — cR(c?). 
If we evaluate P(c) by evaluating Q(c”) and R(c?) separately, one more 
multiplication gives us P(—c) as well. The FFT is based on repeated use of 
this idea, which turns out to be more powerful when we are using complex 
exponentials, because of their periodicity. 


8.7.2 The DFT and Vector DFT 


Suppose that the data are the samples {f(nA),n = 1,...,N}, where 
A > 0 is the sampling increment or sampling spacing. The DFT estimate 
of F(w) is the function Fprr(w), defined for w in [—-7/A,7/A], and given 
by 


N 
Fprr(w) = AX fna) A., 
n=l 
The DFT estimate Fprr(w) is data consistent; its inverse Fourier- 
transform value at t = nA is f (nA) for n = 1, ..., N. The DFT is sometimes 
used in a slightly more general context in which the coefficients are not nec- 
essarily viewed as samples of a function f(t). 
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Given the complex N-dimensional column vector f = (fo, f1,-.-, fv—1)", 
define the DFT of vector f to be the function DF'Ts(w), defined for w in 
(0, 27), given by 


DFT¢(w Eo fer: 

n=0 
Let F be the complex N-dimensional vector F = (Fo, Fi, ..., Fy—1)’, where 
Fy, = DFT;(2rk/N),k = 0,1,..., N—1. So the vector F consists of N values 
of the function DFTp, taken at N equi-spaced points 27/N apart in [0, 27). 
From the formula for DFT; we have, for k = 0,1,...,N — 1 


sey ; 


Fy = F(2rk/N) = > per. (8.9) 


To calculate a single Fẹ requires N multiplications; it would seem that to 
calculate all N of them would require N? multiplications. However, using 
the FFT algorithm, we can calculate vector F in approximately N log,(N) 
multiplications. 


8.7.3 Exploiting Redundancy 
Suppose that N = 2M is even. We can rewrite Equation (8.9) as follows: 


M-1 M-1 
Fy = 5 fome? OKEN iz 5y fope EEDEN 
m=0 m=0 


or, equivalently, 


M-1 M-1 
F, = 5 peers 4 e2tik/N 5 Foma e RRM: (8.10) 


m=0 m=0 
Note that if 0 < k < M — 1 then 


= M-1 
Fram = 5 Perr = e2Tik/N 5y Fie ORM (8.11) 


m=0 


so there is no additional computational cost in calculating the second half 
of the entries of F, once we have calculated the first half. The FFT is the 
algorithm that results when we take full advantage of the savings obtainable 
by splitting a DFT calculation into two similar calculations, each half the 
size. 

We assume now that N = 2”. Notice that if we use Equations (8.10) 
and (8.11) to calculate vector F, the problem reduces to the calculation of 
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two similar DFT evaluations, both involving half as many entries, followed 
by one multiplication for each of the k between 0 and M — 1. We can split 
these in half as well. The FFT algorithm involves repeated splitting of the 
calculations of DFTs at each step into two similar DFTs, but with half the 
number of entries, followed by as many multiplications as there are entries 
in either one of these smaller DFTs. We use recursion to calculate the cost 
C(N) of computing F using this FFT method. From Equation (8.10) we 
see that C(N) = 2C(N/2) + (N/2). Applying the same reasoning to get 
C(N/2) = 2C(.N/4) + (N/4), we obtain 


C(N) = 2C(N/2) + (N/2) = 4C(.N/4) + 2(N/2) =... 
= 2 C(N/2}) + L(N/2) = N + L(N/2). 


Therefore, the cost required to calculate F is approximately N log, N. 

From our earlier discussion of discrete linear filters and convolution, we 
see that the FFT can be used to calculate the periodic convolution (or even 
the nonperiodic convolution) of finite length vectors. 

Finally, let’s return to the original context of estimating the Fourier 
transform F(w) of function f(t) from finitely many samples of f(t). If we 
have N equi-spaced samples, we can use them to form the vector f and 
perform the FFT algorithm to get vector F consisting of N values of the 
DFT estimate of F(w). It may happen that we wish to calculate more 
than N values of the DFT estimate, perhaps to produce a smooth looking 
graph. We can still use the FFT, but we must trick it into thinking we have 
more data than the N samples we really have. We do this by zero-padding. 
Instead of creating the N-dimensional vector f, we make a longer vector by 
appending, say, J zeros to the data, to make a vector that has dimension 
N + J. The DFT estimate is still the same function of w, since we have 
only included new zero coefficients as fake data; but, the FFT thinks we 
have N + J data values, so it returns N + J values of the DFT, at N+ J 
equi-spaced values of w in [0, 27). 


8.7.4 The Two-Dimensional Case 


Suppose now that we have the data {f(mA,,nA,)}, for m = 1, ..., M 
and n = 1,...,N, where A, > 0 and A, > 0 are the sample spacings in 
the x and y directions, respectively. The DFT of this data is the function 
Foprr(a, b) defined by 


M N 
Forr(a, 8) z5 A,Ay 5 5 fare nhs mA tEn), 


m=l1n=1 


for |a| < a/A, and |8| < 7/A,. The two-dimensional FFT produces MN 
values of Fprr(a, 2) on a rectangular grid of M equi-spaced values of a 
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and N equi-spaced values of 8. This calculation proceeds as follows. First, 
for each fixed value of n, a FFT of the M data points {f(mA,z,nA,)},m = 
1,..., M is calculated, producing a function, say G(am,nAy), of M equi- 
spaced values of a and the N equi-spaced values nA,. Then, for each 
of the M equi-spaced values of a, the FFT is applied to the N values 
G(aQm,nAy,),n =1,...,N, to produce the final result. 
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Chapter Summary 


We have seen how the Fourier transform arises naturally as we analyze 
the signals received in the far field from an array of transmitters or reflec- 
tors. In this chapter we describe the role played by the wave equation in 
remote sensing, focusing on plane-wave solutions. We shall consider this 
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topic in more detail in Chapter 24. We restrict our attention here to single- 
frequency, or narrow-band, signals. We begin with a simple illustration of 
some of the issues we deal with in greater detail later in this chapter. 


9.2 The Bobbing Boats 


Imagine a large swimming pool in which there are several toy boats 
arrayed in a straight line. Although we use Figure 9.1 for a slightly different 
purpose elsewhere, for now we can imagine that the black dots in that figure 
represent our toy boats. Far across the pool, someone is slapping the water 
repeatedly, generating waves that proceed outward, in essentially concentric 
circles, across the pool. By the time the waves reach the boats, the circular 
shape has flattened out so that the wavefronts are essentially straight lines. 
The straight lines in Figure 9.1 can represent these wavefronts. 

As the wavefronts reach the boats, the boats bob up and down. If the 
lines of the wavefronts were oriented parallel to the line of the boats, then 
the boats would bob up and down in unison. When the wavefronts come 
in at some angle, as shown in the figure, the boats will bob up and down 
out of sync with one another, generally. By measuring the time it takes for 
the peak to travel from one boat to the next, we can estimate the angle of 
arrival of the wavefronts. This leads to two questions: 


1. Is it possible to get the boats to bob up and down in unison, even 
though the wavefronts arrive at an angle, as shown in the figure? 


2. Is it possible for wavefronts corresponding to two different angles of 
arrival to affect the boats in the same way, so that we cannot tell 
which of the two angles is the real one? 


We need a bit of mathematical notation. We let the distance from each 
boat to the ones on both sides be a constant distance A. We assume that 
the water is slapped f times per second, so f is the frequency, in units of 
cycles per second. As the wavefronts move out across the pool, the distance 
from one peak to the next is called the wavelength, denoted A. The product 
Af is the speed of propagation c; so Af = c. As the frequency changes, so 
does the wavelength, while the speed of propagation, which depends solely 
on the depth of the pool, remains constant. The angle 0 measures the tilt 
between the line of the wavefronts and the line of the boats, so that 0 = 0 
indicates that these wavefront lines are parallel to the line of the boats, 
while 6 = 3 indicates that the wavefront lines are perpendicular to the line 
of the boats. 
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FIGURE 9.1: A uniform line array sensing a plane-wave field. 
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Ex. 9.1 Let the angle 0 be arbitrary, but fixed, and let A be fixed. Can we 
select the frequency f in such a way that we can make all the boats bob up 
and down in unison? 


Ex. 9.2 Suppose now that the frequency f is fixed, but we are free to alter 
the spacing A. Can we choose A so that we can always determine the true 
angle of arrival? 


9.3 Transmission and Remote Sensing 


For pedagogical reasons, we shall discuss separately what we shall call 
the transmission and the remote-sensing problems, although the two prob- 
lems are opposite sides of the same coin, in a sense. In the one-dimensional 
transmission problem, it is convenient to imagine the transmitters located 
at points (2,0) within a bounded interval [—A, A] of the x-axis, and the 
measurements taken at points P lying on a circle of radius D, centered at 
the origin. The radius D is large, with respect to A. It may well be the 
case that no actual sensing is to be performed, but rather, we are simply 
interested in what the received signal pattern is at points P distant from 
the transmitters. Such would be the case, for example, if we were analyzing 
or constructing a transmission pattern of radio broadcasts. In the remote- 
sensing problem, in contrast, we imagine, in the one-dimensional case, that 
our sensors occupy a bounded interval of the x-axis, and the transmitters 
or reflectors are points of a circle whose radius is large, with respect to 
the size of the bounded interval. The actual size of the radius does not 
matter and we are interested in determining the amplitudes of the trans- 
mitted or reflected signals, as a function of angle only. Such is the case 
in astronomy, far-field sonar or radar, and the like. Both the transmission 
and remote-sensing problems illustrate the important role played by the 
Fourier transform. 


9.4 The Transmission Problem 


We identify two distinct transmission problems: the direct problem and 
the inverse problem. In the direct transmission problem, we wish to deter- 
mine the far-field pattern, given the complex amplitudes of the transmitted 
signals. In the inverse transmission problem, the array of transmitters or 
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reflectors is the object of interest; we are given, or we measure, the far-field 
pattern and wish to determine the amplitudes. For simplicity, we consider 
only single-frequency signals. 

We suppose that each point x in the interval [—A, A] transmits the 
signal f(x)e’*’, where f(x) is the complex amplitude of the signal and 
w > 0 is the common fixed frequency of the signals. Let D > 0 be large, with 
respect to A, and consider the signal received at each point P given in polar 
coordinates by P = (D,@). The distance from (x,0) to P is approximately 
D—x«cos9, so that, at time t, the point P receives from (2,0) the signal 
f(a)et-(P-# 008 9)/c) where c is the propagation speed. Therefore, the 
combined signal received at P is 


5 w cos 0 


A 
B(P,t) = an. f(a)” «daz. 
-A 


The integral term, which gives the far-field pattern of the transmission, is 


0 A asst 
PPO) f pape ar, 
c =A 


where F(y) is the Fourier transform of f(x), given by 


A 
F(y) = J 10da. 


How F(#*2) behaves, as a function of 0, as we change A and w, is dis- 
cussed in some detail in the chapter on direct transmission. 

Consider, for example, the function f(x) = 1, for |x| < A, and f(x) = 0, 
otherwise. The Fourier transform of f(x) is 


F(y) = 2Asinc(Ay), 
where sinc(t) is defined to be 


sin(t) 
t $ 


sinc(t) = 


for t # 0, and sinc(0) = 1. Then F(#S%%) = 2A when cos@ = 0, so when 
0 = Z and 0 = 3%. We will have F(#°%) = 0 when AY = y, or 
cos@ = 4©. Therefore, the transmission pattern has no nulls if 75 > 1. 
In order for the transmission pattern to have nulls, we need A > à, where 
A= ome is the wavelength. This rather counterintuitive fact, namely that we 
need more signals transmitted in order to receive less at certain locations, 


illustrates the phenomenon of destructive interference. 
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9.5 Reciprocity 


For certain remote-sensing applications, such as sonar and radar array 
processing and astronomy, it is convenient to switch the roles of sender 
and receiver. Imagine that superimposed plane-wave fields are sensed at 
points within some bounded region of the interior of the sphere, having 
been transmitted or reflected from the points P on the surface of a sphere 
whose radius D is large with respect to the bounded region. The reciprocity 
principle tells us that the same mathematical relation holds between points 
P and (a,0), regardless of which is the sender and which the receiver. 
Consequently, the data obtained at the points (x,0) are then values of the 
inverse Fourier transform of the function describing the amplitude of the 
signal sent from each point P. 


9.6 Remote Sensing 


A basic problem in remote sensing is to determine the nature of a distant 
object by measuring signals transmitted by or reflected from that object. 
If the object of interest is sufficiently remote, that is, is in the far field, the 
data we obtain by sampling the propagating spatio-temporal field is related, 
approximately, to what we want by Fourier transformation. The problem 
is then to estimate a function from finitely many (usually noisy) values 
of its Fourier transform. The application we consider here is a common 
one of remote-sensing of transmitted or reflected waves propagating from 
distant sources. Examples include optical imaging of planets and asteroids 
using reflected sunlight, radio-astronomy imaging of distant sources of radio 
waves, active and passive sonar, and radar imaging. 


9.7 The Wave Equation 


In many areas of remote sensing, what we measure are the fluctuations 
in time of an electromagnetic or acoustic field. Such fields are described 
mathematically as solutions of certain partial differential equations, such 
as the wave equation. A function u(a,y,z,t) is said to satisfy the three- 
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dimensional wave equation if 
Utt = C (Use T Uyy + Uzz) = eV’ u, 


where uz; denotes the second partial derivative of u with respect to the time 
variable t twice and c > 0 is the (constant) speed of propagation. More 
complicated versions of the wave equation permit the speed of propagation 
c to vary with the spatial variables x,y,z, but we shall not consider that 
here. 

We use the method of separation of variables at this point, to get some 
idea about the nature of solutions of the wave equation. Assume, for the 
moment, that the solution u(t, x,y,z) has the simple form 


ult, x,y, z) = g(t) f(x,y, z). 


Inserting this separated form into the wave equation, we get 


CMU RACE z) _ eg (t)V7 f(x,y, z) 


or 
g” (t)/g(t) = PV? f(z, Y, z)/ f(a, Y, Z): 


The function on the left is independent of the spatial variables, while the 
one on the right is independent of the time variable; consequently, they 
must both equal the same constant, which we denote —w?. From this we 
have two separate equations, 


g(t) + w g(t) =0, (9.1) 
and 
2 
V? f(a, y, z) + Tilev, z)=0. (9.2) 


Equation (9.2) is the Helmholtz equation. 

Equation (9.1) has for its solutions the functions g(t) = cos(wt) and 
sin(wt), or, in complex form, the complex exponential functions g(t) = e*t 
and g(t) = e™™t. Functions u(t,z,y,z) = g(t)f(az,y,z) with such time 
dependence are called time-harmonic solutions. 

In three-dimensional spherical coordinates with r = yx? + y? + 22 a 
radial function u(r, t) satisfies the wave equation if 


rg 
Utt = C Ure Me $ 


Ex. 9.3 Show that the radial function u(r, t) = +h(r—ct) satisfies the wave 


equation for any twice differentiable function h. 
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9.8 Plane-Wave Solutions 


Suppose that, beginning at time t = 0, there is a localized disturbance. 
As time passes, that disturbance spreads out spherically. When the radius 
of the sphere is very large, the surface of the sphere appears planar, to 
an observer on that surface, who is said then to be in the far field. This 
motivates the study of solutions of the wave equation that are constant on 
planes; the so-called plane-wave solutions. 


Ex. 9.4 Let s = (x,y,z) and u(s,t) = u(z, y, z, t) = etes. Show that u 
satisfies the wave equation un = c?V2u for any real vector k, so long as 
\|k||? = w?/c?. This solution is a plane wave associated with frequency w 
and wavevector k; at any fixed time the function u(s, t) is constant on any 
plane in three-dimensional space having k as a normal vector. 


In radar and sonar, the field u(s, t) being sampled is usually viewed as 
a discrete or continuous superposition of plane-wave solutions with various 
amplitudes, frequencies, and wavevectors. We sample the field at various 
spatial locations s, for various times t. Here we simplify the situation a 
bit by assuming that all the plane-wave solutions are associated with the 
same frequency, w. If not, we can perform an FFT on the functions of time 
received at each sensor location s and keep only the value associated with 
the desired frequency w. 


9.9 Superposition and the Fourier Transform 


In the continuous superposition model, the field is a superposition of 
plane-wave solutions 


u(s, t) = et J F(k)e™ sdk. 
Our measurements at the sensor locations s give us the values 

f(s) = f F(k)e™ sdk. (9.3) 
The data are then Fourier transform values of the complex function F (k); 


F(k) is defined for all three-dimensional real vectors k, but is zero, in the- 
ory, at least, for those k whose squared length ||k||? is not equal to w?/c?. 
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Our goal is then to estimate F (k) from measured values of its Fourier trans- 
form. Since each k is a normal vector for its plane-wave field component, 
determining the value of F'(k) will tell us the strength of the plane-wave 
component coming from the direction k. 


9.9.1 The Spherical Model 


We can imagine that the sources of the plane-wave fields are the points 
P that lie on the surface of a large sphere centered at the origin. For each 
P, the ray from the origin to P is parallel to some wavevector k. The 
function F (k) can then be viewed as a function F(P) of the points P. Our 
measurements will be taken at points s inside this sphere. The radius of 
the sphere is assumed to be orders of magnitude larger than the distance 
between sensors. The situation is that of astronomical observation of the 
heavens using ground-based antennas. The sources of the optical or electro- 
magnetic signals reaching the antennas are viewed as lying on a large sphere 
surrounding the earth. Distance to the sources is not considered now, and 
all we are interested in are the amplitudes F'(k) of the fields associated 
with each direction k. 


9.10 Sensor Arrays 


In some applications the sensor locations are essentially arbitrary, while 
in others their locations are carefully chosen. Sometimes, the sensors are 
collinear, as in sonar towed arrays. Figure 9.1 illustrates a line array. 


9.10.1 The Two-Dimensional Array 


Suppose now that the sensors are in locations s = (x, y,0), for various 
x and y; then we have a planar array of sensors. Then the dot product s-k 
that occurs in Equation (9.3) is 


s- k = ak, + yk; 


we cannot see the third component, k3. However, since we know the size 
of the vector k, we can determine |k3|. The only ambiguity that remains 
is that we cannot distinguish sources on the upper hemisphere from those 
on the lower one. In most cases, such as astronomy, it is obvious in which 
hemisphere the sources lie, so the ambiguity is resolved. 

The function F(k) can then be viewed as F(kı, k2), a function of the 
two variables kı and k2. Our measurements give us values of f(x,y), the 
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two-dimensional Fourier transform of F'(k1, k2). Because of the limitation 
\|k|| = £, the function F(k,,k2) has bounded support. Consequently, its 
Fourier transform cannot have bounded support. As a result, we can never 
have all the values of f(x,y), and so cannot hope to reconstruct F'(k1, k2) 
exactly, even for noise-free data. 


9.10.2 The One-Dimensional Array 


If the sensors are located at points s having the form s = (x, 0,0), then 
we have a line array of sensors. The dot product in Equation (9.3) becomes 


s-k=ck,. 


Now the ambiguity is greater than in the planar array case. Once we have 
kı, we know that 


which describes points P lying on a circle on the surface of the distant 
sphere, with the vector (k1,0,0) pointing at the center of the circle. It 
is said then that we have a cone of ambiguity. One way to resolve the 
situation is to assume kg = 0; then |k2| can be determined and we have 
remaining only the ambiguity involving the sign of k2. Once again, in many 
applications, this remaining ambiguity can be resolved by other means. 

Once we have resolved any ambiguity, we can view the function F'(k) as 
F(kı), a function of the single variable kı. Our measurements give us values 
of f(a), the Fourier transform of F (kı). As in the two-dimensional case, the 
restriction on the size of the vectors k means that the function F(k,) has 
bounded support. Consequently, its Fourier transform, f(x), cannot have 
bounded support. Therefore, we shall never have all of f(x), and so cannot 
hope to reconstruct F (kı) exactly, even for noise-free data. 


9.10.3 Limited Aperture 


In both the one- and two-dimensional problems, the sensors will be 
placed within some bounded region, such as |a| < A, |y| < B for the 
two-dimensional problem, or |x| < A for the one-dimensional case. These 
bounded regions are the apertures of the arrays. The larger these apertures 
are, in units of the wavelength, the better the resolution of the reconstruc- 
tions. 

In digital array processing there are only finitely many sensors, which 
then places added limitations on our ability to reconstruct the field ampli- 
tude function F'(k). 
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9.11 Sampling 


In the one-dimensional case, the signal received at the point (x, 0,0) 
is essentially the inverse Fourier transform f(x) of the function F (kı); for 
notational simplicity, we write k = kı. The F'(k) supported on a bounded 
interval |k| < 4, so f(x) cannot have bounded support. As we noted earlier, 
to determine F(k) exactly, we would need measurements of f(x) on an 
unbounded set. But, which unbounded set? 

Because the function F'(k) is zero outside the interval [-#, £], the func- 
tion f(x) is band-limited. The Nyquist spacing in the variable x is therefore 


TC 
A, = —. 


W 


The wavelength \ associated with the frequency w is defined to be 


yan 
w 
so that : 
Aam, 
2 


The significance of the Nyquist spacing comes from Shannon’s Sampling 
Theorem, which says that if we have the values f(mA,,), for all integers m, 
then we have enough information to recover F'(k) exactly. In practice, of 
course, this is never the case. 


9.12 The Limited-Aperture Problem 


In the remote-sensing problem, our measurements at points (x, 0,0) in 
the far field give us the values f(x). Suppose now that we are able to take 
measurements only for limited values of x, say for |x| < A; then 2A is the 
aperture of our antenna or array of sensors. We describe this by saying that 
we have available measurements of f(x)h(x), where h(x) = ya(x) = 1, 
for |x| < A, and zero otherwise. So, in addition to describing blurring and 
low-pass filtering, the convolution-filter model can also be used to model 
the limited-aperture problem. As in the low-pass case, the limited-aperture 
problem can be attacked using extrapolation, but with the same sort of risks 
described for the low-pass case. A much different approach is to increase the 
aperture by physically moving the array of sensors, as in synthetic aperture 
radar (SAR). 


144 Signal Processing: A Mathematical Approach 


Returning to the far-field remote-sensing model, if we have Fourier 
transform data only for |r| < A, then we have f(a) for |x| < A. Using 
h(a) = ya(a) to describe the limited aperture of the system, the point- 
spread function is H(y) = 2Asinc(yA), the Fourier transform of h(x). The 
first zeros of the numerator occur at |y| = 4, so the main lobe of the 
point-spread function has width 2. For this reason, the resolution of such 
a limited-aperture imaging system is said to be on the order of 4. Since 
|k| < #, we can write k = € sin, where @ denotes the angle between the 
positive y-axis and the vector k = (kı, k2, 0); that is, 0 points in the direc- 
tion of the point P associated with the wavevector k. The resolution, as 
measured by the width of the main lobe of the point-spread function H (y), 
in units of k, is 25, but, the angular resolution will depend also on the 
frequency w. Since k = 2n sin 0, a distance of one unit in k may correspond 
to a large change in 0 when w is large, but only to a relatively small change 
in 0 when w is small. For this reason, the aperture of the array is usually 
measured in units of the wavelength; an aperture of A = 5 meters may be 
acceptable if the frequency is high, so that the wavelength is small, but not 
if the radiation is in the one-meter-wavelength range. 


9.13 Resolution 


If F(k) = (k) and h(x) = x4(x) describes the aperture-limitation of 
the imaging system, then the point-spread function is H(y) = 2Asinc(7A). 
The maximum of H(y) still occurs at y = 0, but the main lobe of H(y) 
extends from —4 to 4; the point source has been spread out. If the point- 
source object shifts, so that F'(k) = 6(k — a), then the reconstructed image 
of the object is H(k— a), so the peak is still in the proper place. If we know 
a priori that the object is a single point source, but we do not know its 
location, the spreading of the point poses no problem; we simply look for 
the maximum in the reconstructed image. Problems arise when the object 
contains several point sources, or when we do not know a priori what we 
are looking at, or when the object contains no point sources, but is just a 
continuous distribution. 

Suppose that F'(k) = 6(k — a) + d(k — b); that is, the object consists 
of two point sources. Then Fourier transformation of the aperture-limited 
data leads to the reconstructed image 


R(k) = 2A(sinc(A(k — a)) + sinc( A(k — b))). 


If |b — al is large enough, R(k) will have two distinct maxima, at approx- 
imately k = a and k = b, respectively. For this to happen, we need 7/A, 
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half the width of the main lobe of the function sinc(Ak), to be less than 
|b — a|. In other words, to resolve the two point sources a distance |b — a| 
apart, we need A > 1/|b— a|. However, if |b — a| is too small, the distinct 
maxima merge into one, at k = ath and resolution will be lost. How small 
is too small will depend on both A and w. 

Suppose now that F(k) = 6(k — a), but we do not know a priori that 


the object is a single point source. We calculate 
R(k) = H(k — a) = 2Asinc(A(k — a)) 


and use this function as our reconstructed image of the object, for all k. 
What we see when we look at R(k) for some k = b Æ a is R(b), which is 
the same thing we see when the point source is at k = b and we look at 
k =a. Point-spreading is, therefore, more than a cosmetic problem. When 
the object is a point source at k = a, but we do not know a priori that it 
is a point source, the spreading of the point causes us to believe that the 
object function F (k) is nonzero at values of k other than k = a. When we 
look at, say, k = b, we see a nonzero value that is caused by the presence 
of the point source at k = a. 

Assume now that the object function F(k) contains no point sources, 
but is simply an ordinary function of k. If the aperture A is very small, then 
the function H(k) is nearly constant over the entire extent of the object. 
The convolution of F(k) and H(k) is essentially the integral of F'(k), so 
the reconstructed object is R(k) = f F(k)dk, for all k. Let’s see what this 
means for the solar-emission problem discussed earlier. 


9.13.1 The Solar-Emission Problem Revisited 


The wavelength of the radiation is A = 1 meter. Therefore, # = 27, 


and k in the interval [—27, 27] corresponds to the angle 6 in [0, 7]. The sun 
has an angular diameter of 30 minutes of arc, which is about 107? radians. 
Therefore, the sun subtends the angles @ in [5 —(0.5)-10~?, 3 + (0.5)-107], 
which corresponds roughly to the variable k in the interval [-3- 1072,3- 
10-7]. Resolution of 3 minutes of arc means resolution in the variable k of 
3- 107°. If the aperture is 2A, then to achieve this resolution, we need 


or z 
AS oro 
73 


meters, or A not less than about 1000 meters. 

The radio-wave signals emitted by the sun are focused, using a parabolic 
radio-telescope. The telescope is pointed at the center of the sun. Because 
the sun is a great distance from the earth and the subtended arc is small 
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(30 min.), the signals from each point on the sun’s surface arrive at the 
parabola nearly head-on, that is, parallel to the line from the vertex to the 
focal point, and are reflected to the receiver located at the focal point of 
the parabola. The effect of the parabolic antenna is not to discriminate 
against signals coming from other directions, since there are none, but to 
effect a summation of the signals received at points (#,0,0), for |z| < A, 
where 2A is the diameter of the parabola. When the aperture is large, the 
function h(a) is nearly one for all x and the signal received at the focal 
point is essentially 


f(x)dx = F(0); 


we are now able to distinguish between F (0) and other values F'(k). When 
the aperture is small, h(a) is essentially d(x) and the signal received at the 
focal point is essentially 


J Fede = FO) = f Faak: 


now all we get is the contribution from all the k, superimposed, and all 
resolution is lost. 

Since the solar emission problem is clearly two-dimensional, and we need 
3 min. resolution in both dimensions, it would seem that we would need a 
circular antenna with a diameter of about one kilometer, or a rectangular 
antenna roughly one kilometer on a side. Eventually, this problem was 
solved by converting it into essentially a tomography problem and applying 
the same techniques that are today used in CAT scan imaging. 


9.13.2 Other Limitations on Resolution 


In imaging regions of the earth from satellites in orbit there is a trade-off 
between resolution and the time available to image a given site. Satellites in 
geostationary orbit, such as weather and TV satellites, remain stationary, 
relative to a fixed position on the earth’s surface, but to do so must orbit 
22,000 miles above the earth. If we tried to image the earth from that 
height, a telescope like the Hubble Space Telescope would have a resolution 
of about 21 feet, due to the unavoidable blurring caused by the optics of 
the lens itself. The Hubble orbits 353 miles above the earth, but because 
it looks out into space, not down to earth, it only needs to be high enough 
to avoid atmospheric distortions. Spy satellites operate in low Earth orbit 
(LEO), about 200 miles above the earth, and achieve a resolution of about 
2 or 3 inches, at the cost of spending only about 1 or 2 minutes over their 
target. The satellites used in the GPS system maintain a medium Earth 
orbit (MEO) at a height of about 12,000 miles, high enough to be seen 
over the horizon most of the time, but not so high as to require great 
power to send their signals. 
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In the February 2003 issue of Harper’s Magazine there is an article on 
“scientific apocalypse” dealing with the search for near-earth asteroids. 
These objects are initially detected by passive optical observation, as small 
dots of reflected sunlight; once detected, they are then imaged by active 
radar to determine their size, shape, rotation and such. Some Russian as- 
tronomers are concerned about the near-earth asteroid Apophis 2004 MN4, 
which, they say, will pass within 30, 000 km of earth in 2029, and come even 
closer in 2036. This is closer to earth than the satellites in geostationary 
orbit. As they say, “Stay tuned for further developments.” 


9.14 Discrete Data 


A familiar topic in signal processing is the passage from functions of 
continuous variables to discrete sequences. This transition is achieved by 
sampling, that is, extracting values of the continuous-variable function at 
discrete points in its domain. Our example of far-field propagation can be 
used to explore some of the issues involved in sampling. 

Imagine an infinite uniform line array of sensors formed by placing 
receivers at the points (nA,0,0), for some A > 0 and all integers n. Then 
our data are the values f(nA). Because we defined k = £cos9, it is clear 
that the function F (k) is zero for k outside the interval |- 4, ]. 

Our discrete array of sensors cannot distinguish between the signal ar- 
riving from @ and a signal with the same amplitude, coming from an angle 
a with 


w w 2T 
— cosa = — cos 0 + —m, 
c c 


A 
A > 0 so that 
w 2r w 
c A” @ 
or A 
Agna 
w 


The sensor spacing As = à is the Nyquist spacing. 

In the sunspot example, the object function F'(k) is zero for k outside of 
an interval much smaller than [-#, #]. Knowing that F(k) = 0 for |k| > K, 
for some 0 < K < #2, we can accept ambiguities that confuse 0 with another 
angle that lies outside the angular diameter of the object. Consequently, 


we can redefine the Nyquist spacing to be 


Kae, 
K 
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This tells us that when we are imaging a distant object with a small angular 
diameter, the Nyquist spacing is greater than à. If our sensor spacing has 
been chosen to be à, then we have oversampled. In the oversampled case, 
band-limited extrapolation methods can be used to improve resolution. 


9.14.1 Reconstruction from Samples 


From the data gathered at our infinite array we have extracted the 
Fourier transform values f (nA), for all integers n. The obvious question is 
whether or not the data is sufficient to reconstruct F'(k). We know that, to 
avoid ambiguity, we must have A < =. The good news is that, provided 
this condition holds, F'(k) is uniquely determined by this data and formu- 
las exist for reconstructing F'(k) from the data; this is the content of the 
Shannon’s Sampling Theorem. Of course, this is only of theoretical interest, 
since we never have infinite data. Nevertheless, a considerable amount of 
traditional signal-processing exposition makes use of this infinite-sequence 
model. The real problem, of course, is that our data is always finite. 


9.15 The Finite-Data Problem 


Suppose that we build a uniform line array of sensors by placing re- 
ceivers at the points (nA,0,0), for some A > 0 and n = —N,...,N. Then 
our data are the values f(nA), for n = —N,..., N. Suppose, as previously, 
that the object of interest, the function F (k), is nonzero only for values of 
k in the interval [—K, K], for some 0 < K < 2. Once again, we must have 
A < *£ to avoid ambiguity; but this is not enough, now. The finite Fourier 
data is no longer sufficient to determine a unique F(k). The best we can 
hope to do is to estimate the true F'(k), using both our measured Fourier 
data and whatever prior knowledge we may have about the function F'(k), 
such as where it is nonzero, if it consists of Dirac delta point sources, or if 
it is nonnegative. The data is also noisy, and that must be accounted for 
in the reconstruction process. 

In certain applications, such as sonar array processing, the sensors are 
not necessarily arrayed at equal intervals along a line, or even at the grid 
points of a rectangle, but in an essentially arbitrary pattern in two, or even 
three, dimensions. In such cases, we have values of the Fourier transform 
of the object function, but at essentially arbitrary values of the variable. 
How best to reconstruct the object function in such cases is not obvious. 
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9.16 Functions of Several Variables 


Fourier transformation applies, as well, to functions of several variables. 
As in the one-dimensional case, we can motivate the multi-dimensional 
Fourier transform using the far-field propagation model. As we noted ear- 
lier, the solar emission problem is inherently a two-dimensional problem. 


9.16.1 A Two-Dimensional Far-Field Object 


Assume that our sensors are located at points s = (a, y,0) in the z,y- 
plane. As discussed previously, we assume that the function F'(k) can be 
viewed as a function F(kı, k2). Since, in most applications, the distant 
object has a small angular diameter when viewed from a great distance — 
the sun’s is only 30 minutes of arc — the function F'(k1, k2) will be supported 
on a small subset of vectors (k1, k2). 


9.16.2 Limited Apertures in Two Dimensions 


Suppose we have the values of the Fourier transform, f(x,y), for |z| < A 
and |y| < A. We describe this limited-data problem using the function 
h(a,y) that is one for |z| < A, and |y| < A, and zero, otherwise. Then the 
point-spread function is the Fourier transform of this h(a, y), given by 


H(a, 8) = 4ABsinc(Aa)sinc( BB). 


The resolution in the horizontal (x) direction is on the order of 4, and 
+ in the vertical, where, as in the one-dimensional case, aperture is best 
measured in units of wavelength. 

Suppose our aperture is circular, with radius A. Then we have Fourier 
transform values f(x,y) for yx? +y? < A. Let h(x,y) equal one, for 
\/a? +y? < A, and zero, otherwise. Then the point-spread function of 
this limited-aperture system is the Fourier transform of h(x, y), given by 
H(a, 8) = ZJ (rA), with r = y'a? + p2. The resolution of this system is 
roughly the distance from the origin to the first null of the function Jı (rA), 
which means that rA = 4, roughly. 

For the solar emission problem, this says that we would need a circular 
aperture with radius approximately one kilometer to achieve 3 minutes of 
arc resolution. But this holds only if the antenna is stationary; a moving 
antenna is different! The solar emission problem was solved by using a 
rectangular antenna with a large A, but a small B, and exploiting the 
rotation of the earth. The resolution is then good in the horizontal, but bad 
in the vertical, so that the imaging system discriminates well between two 


150 Signal Processing: A Mathematical Approach 


distinct vertical lines, but cannot resolve sources within the same vertical 
line. Because B is small, what we end up with is essentially the integral 
of the function f(x, z) along each vertical line. By tilting the antenna, and 
waiting for the earth to rotate enough, we can get these integrals along any 
set of parallel lines. The problem then is to reconstruct F'(k1, k2) from such 
line integrals. This is also the main problem in tomography. 


9.17 Broadband Signals 


We have spent considerable time discussing the case of a distant point 
source or an extended object transmitting or reflecting a single-frequency 
signal. If the signal consists of many frequencies, the so-called broadband 
case, we can still analyze the received signals at the sensors in terms of time 
delays, but we cannot easily convert the delays to phase differences, and 
thereby make good use of the Fourier transform. One approach is to filter 
each received signal, to remove components at all but a single frequency, 
and then to proceed as previously discussed. In this way we can process one 
frequency at a time. The object now is described in terms of a function of 
both k and w, with F (k, w) the complex amplitude associated with the wave 
vector k and the frequency w. In the case of radar, the function F'(k, w) tells 
us how the material at P reflects the radio waves at the various frequencies 
w, and thereby gives information about the nature of the material making 
up the object near the point P. 

There are times, of course, when we do not want to decompose a broad- 
band signal into single-frequency components. A satellite reflecting a TV 
signal is a broadband point source. All we are interested in is receiving the 
broadband signal clearly, free of any other interfering sources. The direc- 
tion of the satellite is known and the antenna is turned to face the satellite. 
Each location on the parabolic dish reflects the same signal. Because of its 
parabolic shape, the signals reflected off the dish and picked up at the focal 
point have exactly the same travel time from the satellite, so they combine 
coherently, to give us the desired TV signal. 
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The Phase Problem 


10.1- Chapter Summary <i. eae thea eke EE Be 151 
10.2 Reconstructing from Over-Sampled Complex FT Data ......... 152 
10.3 The Phase Problem ............. 00. ce cece cece eee c nent eet e ences 154 
10.4 A Phase-Retrieval Algorithm ........... 0c. cece cece eee eee eee 154 
10.5 Fienup’s'Method 2.4... 0a) eene vaeane ineo e aa dtbar meh reas 156 
10.6 Does the Iteration Converge? ......... 0. cece cece eee 156 


10.1 Chapter Summary 


One of the main problems we consider in this book is the estimation 
of a function from finitely many values of its Fourier transform. In such 
cases, the data are complex numbers and the function to be estimated is 
a complex-valued function. As we mentioned previously, there are certain 
cases in which we have a phase problem, where it is not possible to measure 
the complex numbers, but only their magnitudes. Estimating the structure 
of a crystal from scattering data in x-ray crystallography and optical imag- 
ing through a turbulent atmosphere are two examples in which the phase 
problem arises. As you might imagine, reconstruction from magnitude-only 
data is more difficult than from the full complex data. 

In this chapter we describe an algorithm for solving the phase problem 
that is based on the MDFT estimator discussed previously. This algorithm 
was originally introduced in [30]. The reader is invited to consult [30] for 
additional details and examples. 
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10.2 Reconstructing from Over-Sampled Complex FT 
Data 


Let f : [—a,7] > C have Fourier transform 
Fo) = | > f(a)e*de. 


The Fourier series expansion for f(x) is then 


Co 


x 
| 
E 
z 
E 

L 
Š 


JORE 


If we are able to obtain only the values F (n) for |n| < N, then the DFT 
estimate of f(x) is 


forr(z)=5- >> Fine, 


for |x| < m. We denote the data vector by d = (F(—N),..., F(N))?. 
We assume now that f(x) = 0 for x outside the interval V = [—v, v], 
for some v with 0 <u < T. 


Ex. 10.1 Let S be the 2N +1 by 2N +1 matrix with entries 


sinv(n — m 
Sm.n = ee) 


a(n —m) 


form#n, and Smm = 2. Show that 


2r | |forr(a)Pde = d's, 


—vU 


2r | |forr(x)|?dx = d'd. 


Therefore, the amount of DFT energy outside the interval [—v, v] is 


T v 1 
/ orr(o)Pdr— | |DFT (x) "da = > (d'd — d Sa). 


Gama. 


The proportion of DFT energy outside V = [—v, v] is then 
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where Amaz(S) denotes the largest eigenvalue of the positive-definite sym- 
metric matrix S. 

When v is close to m or N is large, Amaz(S) is near one. The trace of 
the matrix S is trace($)= (2N + 1)=, and, as N approaches +00, roughly 
(2N +1) eigenvalues of S have values that are approximately equal to one 
and the remainder have values approximately equal to zero. It is curious to 
note that, for large values of N, a plot of the eigenvalues of S, in decending 
order of size, resembles the graph of the right half of the function yy (x). 
This is one case in which an eigenspectrum and a power spectrum are 
related. 

The lower bound on the proportion of energy outside V will be attained 
if d is replaced by an eigenvector of S with eigenvalue Amaz( S). Then the 
DFT will be maximally concentrated within V, but will be quite smooth 
and have little structure. When the function f(x) has structure within the 
interval V that we wish to reconstruct, we will need to employ eigenvectors 
of S other than the ones associated with the largest eigenvalues of S. One 
way to do this is to use the MDFT estimator discussed previously. 

The MDFT estimator of the function f(x) is 


N 
fuprr(x) =xv(2) D> bnew, 
n=—N 


where the vector b of coefficients is b = xs —!d. The energy of the function 
ÎÍMDFT (x) is then 


3 1 
J |fuprr(e)| dz = 27b Sb = zisa. 


—vU 


Ex. 10.2 Show that 


d’ Std dt Sd 
eS S 
dd © © dtd 
When the data are truly values of F (n) and the function f(x) has reason- 
able values and is actually supported on the interval V = [—v, v], then the 


energy of the MDFT estimator will not be abnormally large. However, if 
the data values are not at least approximately equal to the values F (n), 
or f(x) is not supported on the interval V, then the MDFT energy will be 
quite large, indicating a mismatch between our data and our assumptions 
about f(x). This behavior can actually be put to good use. In some cases, 
we may not know V, but do not want to overestimate it; we want V to be 
as small as is allowable, but not smaller. We can take a decreasing sequence 
of intervals V and stop when we see the MDFT energy begin to explode. 
We can also use this behavior to solve the phase problem. 


154 Signal Processing: A Mathematical Approach 


10.3 The Phase Problem 


We suppose now that F(n) = |F(n)|e*®™, and we have only the mag- 
nitude data, |F'(n)|, for |n| < N. If we take arbitrary phase angles @(n) and 
create complex “data” 


G(n) = |F (n), 


for |n| < N, we can then pretend to have the complex FT data for f(x) and 
compute the MDFT estimate. Fortunately for us, as the phase angles 0(n) 
begin to differ substantially from the true phase angles (n), the MDFT 
reacts to this mismatch and the MDFT energy increases dramatically. The 
idea is then to monitor the MDFT energy as we make choices of phase 
angles, attempting to find ones that are approximately correct. In the next 
section we present an iterative algorithm to implement this idea. 


10.4 A Phase-Retrieval Algorithm 
Let 6 = (6(—N),...,0(N)) be an arbitrary selection of phase angles and 


d(0) = (\F(-N) |e), a IF(N) e2 ™)T 


our constructed “data” vector having the true magnitudes, but arbitrary 
phases. We shall also denote by d(0) the infinite sequence whose only 
nonzero entries are the entries of the finite vector d(0); the context will 
make clear which interpretation we are using. The energy in the resulting 
MDFT estimator is 


E(0) = ETOO] 


Our objective is to find a choice of angles (n) for which E(0) is not un- 
reasonably large, in the hope that the resulting MDFT will be a decent 
approximation of the true f(x). 

One approach would be to design an iterative algorithm that takes us 
from one phase vector 0¥ to a new one, 6**1, in such a way that E(0*) > 
E(0**+"). Perhaps a gradient-descent algorithm could be devised to do this. 
Instead, we have an iterative algorithm that, at least in our simulations, 
achieves much the same result, by a more indirect approach. 

Let the Hilbert space H be L?[—7, r] and Py the orthogonal projection 
of H onto the subspace L?[—v, v]. For any infinite sequence G = {G(n)}, 
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denote by F~'G the function 
(F“*G)(2) = g(2) = = Gne”, 


for |z| < m. Then we write G = Fg. Define (AG)(n) = G(n), for |n| < N, 
and (AG)(n) = 0, otherwise. Define (DG)(n) = 0, if G(n) = 0, (DG)(n) = 
G(n), for |n| > N, and 


_ IF) 
E] 


(DG) (n) G(n), 
otherwise. Then DA = AD as operators. 

We begin with an arbitrary phase vector 0° and use it to define g° = 
PyF~'d(0°). We let Fg? = G°. Having found g! and G* = Fg", we define 
gk+1 by 

d(o**!) = DAG". 


The iterative step is then 
git! = Py F'(I — A) Fg* + ADF"). 


We can also write 
gr = Py F-+d(6**7). 


Note that 
grt? — g! — Py F-+(DAG* — AG*) = Py FE, 
for 
c® = DAG® — AG". 
Therefore, 
k 
ght = 99 + YO Py Foe" = Py F la (10.1) 
m=0 
for 


It follows then that 
aft! = af + DAG* — AG*. 


From Equation (10.1) we see that each function g* has the form of an 
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MDFT estimator associated with the subset V and the “data” AG*. There- 
fore, 

Sa! = AG*. 
The iteration then becomes 


att! = a! + DGE — GF =a" + DSaĂ = Sa". 


We iterate using this updating step until convergence to some a° and then 
take 
g” = Py Fola” 
as our final estimate of f(x). The energy at each step is 
E(6*) = d(6*)'s~*a(6*), 


so we can easily monitor the energy at each stage of the iteration. 


10.5 Fienup’s Method 


Our algorithm has the iterative step 
git) = Py F H(I — A)Fg* + ADF g*), 


where the operators F and F—! relate infinite sequences to functions of a 
continuous variable. If we choose, instead, to view g” as a finite vector and 
these operators as relating finite vectors to one another via the FFT, we 
get Fienup’s error-reduction method [76, 77]. In the error-reduction method 
what we call here the function g*(x), defined for x in the interval [—7, 7], 
is discretized and replaced by a vector in C7, where J > 2N +1. Similarly, 
the infinite sequence Fg" is replaced by a vector in CY and the operator F 
is replaced by the FFT. 


10.6 Does the Iteration Converge? 


The operator Py is an orthogonal projection onto a subspace of H, and 
the operator 
P= FU\(I-A)F+F DAF 
is also a projection, but its range is not a convex set; therefore, the useful 
convergence theorems about composition of orthogonal projections onto 
convex sets do not apply here. All is not lost, however. 
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In [108] Levi and Stark define the set-distance error 


J(9) = ||Pig = gll2 + || P29 — gle, 


for projections P, and P>, when one of the projections has nonconvex range. 
They show that, for the sequence generated by the iterative step g**+! = 
P, Pog", 

I(gh*") < J(Pag") < J(g"). 
In our case, with P) = Py and P, = P, we find that g*t! is at least as 


close to being consistent with the magnitude data as g* is, and Pg**! is 
at least as close to being supported on V as Pa" is. 
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11.1 Chapter Summary 


Our topic is now transmission tomography. This chapter will provide a 
detailed description of how the data is gathered, the mathematical model of 
the scanning process, the problem to be solved, the various mathematical 
techniques needed to solve this problem, and the manner in which these 
techniques are applied, including filtering methods for inverting the two- 
dimensional Fourier transform. 

According to the Central Slice Theorem, if we have all the line integrals 
through the attenuation function f(x,y) then we have the two-dimensional 
Fourier transform of f(x,y). To get f(x,y) we need to invert the two- 
dimensional Fourier transform. 
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11.2 X-ray Transmission Tomography 


Although transmission tomography is not limited to scanning living be- 
ings, we shall concentrate here on the use of x-ray tomography in medical 
diagnosis and the issues that concern us in that application. The mathe- 
matical formulation will, of course, apply more generally. 

In x-ray tomography, x-rays are transmitted through the body along 
many lines. In some, but not all, cases, the lines will all lie in the same plane. 
The strength of the x-rays upon entering the body is assumed known, and 
the strength upon leaving the body is measured. This data can then be used 
to estimate the amount of attenuation the x-ray encountered along that 
line, which is taken to be the integral, along that line, of the attenuation 
function. On the basis of these line integrals, we estimate the attenuation 
function. This estimate is presented to the physician as one or more two- 
dimensional images. 


11.3 The Exponential-Decay Model 


As an x-ray beam passes through the body, it encounters various types 
of matter, such as soft tissue, bone, ligaments, air, each weakening the 
beam to a greater or lesser extent. If the intensity of the beam upon entry 
is Jin and Jout is its lower intensity after passing through the body, then 

Tout = in€ Sz f, 
where f = f(x,y) > 0 is the attenuation function describing the two- 
dimensional distribution of matter within the slice of the body being 
scanned and f z f is the integral of the function f over the line L along 
which the x-ray beam has passed. To see why this is the case, imagine the 
line L parameterized by the variable s and consider the intensity function 
I(s) as a function of s. For small As > 0, the drop in intensity from the 
start to the end of the interval [s,s + As] is approximately proportional 
to the intensity I(s), to the attenuation f(s) and to As, the length of the 
interval; that is, 
I(s) —I(s+As) ~ f(s)I(s)As. 


Dividing by As and letting As approach zero, we get 
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Ex. 11.1 Show that the solution to this differential equation is 


ei = Taye (- i = f(u)au) . 


=0 
Hint: Use an integrating factor. 


From knowledge of Jin and Tout, we can determine f, f. If we know f, f 
for every line in the x, y-plane we can reconstruct the attenuation func- 
tion f. In the real world we know line integrals only approximately and 
only for finitely many lines. The goal in x-ray transmission tomography 
is to estimate the attenuation function f(x,y) in the slice, from finitely 
many noisy measurements of the line integrals. We usually have prior in- 
formation about the values that f(x,y) can take on. We also expect to find 
sharp boundaries separating regions where the function f(x,y) varies only 
slightly. Therefore, we need algorithms capable of providing such images. 


11.4 Difficulties to Be Overcome 


There are several problems associated with this model. The paths taken 
by x-ray beams are not exactly straight lines; the beams tend to spread 
out. The x-rays are not monochromatic, and their various frequency com- 
ponents are attenuated at different rates, resulting in beam hardening, that 
is, changes in the spectrum of the beam as it passes through the object. 
The beams consist of photons obeying statistical laws, so our algorithms 
probably should be based on these laws. How we choose the line segments is 
determined by the nature of the problem; in certain cases we are somewhat 
limited in our choice of these segments. Patients move; they breathe, their 
hearts beat, and, occasionally, they shift position during the scan. Com- 
pensating for these motions is an important, and difficult, aspect of the 
image reconstruction process. Finally, to be practical in a clinical setting, 
the processing that leads to the reconstructed image must be completed in 
a short time, usually around fifteen minutes. This time constraint is what 
motivates viewing the three-dimensional attenuation function in terms of 
its two-dimensional slices. 

As we shall see, the Fourier transform and the associated theory of con- 
volution filters play important roles in the reconstruction of transmission 
tomographic images. 

The data we actually obtain at the detectors are counts of detected 
photons. These counts are not the line integrals; they are random quan- 
tities whose means, or expected values, are related to the line integrals. 
The Fourier inversion methods for solving the problem ignore its statistical 
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aspects; in contrast, other methods, such as likelihood maximization, are 
based on a statistical model that involves Poisson-distributed emissions. 


11.5 Reconstruction from Line Integrals 


We turn now to the underlying problem of reconstructing attenuation 
functions from line-integral data. 


11.5.1 The Radon Transform 


Our goal is to reconstruct the function f(x,y) > 0 from line-integral 
data. Let 0 be a fixed angle in the interval [0, m). Form the t, s-axis system 
with the positive t-axis making the angle 0 with the positive x-axis, as 
shown in Figure 11.1. Each point (x,y) in the original coordinate system 
has coordinates (t, s) in the second system, where the t and s are given by 


t = x cos + ysin 0, 
and 
s = —x sin 0 + y cos ð. 


If we have the new coordinates (t,s) of a point, the old coordinates are 


(x,y) given by 
x =tcosé— ssin 0, 


and 
y = tsin + s cosð. 


We can then write the function f as a function of the variables t and s. For 
each fixed value of t, we compute the integral 


J tlevas = f ftcos0— ssing, tsing + 5 c0s8)ds 
L 


along the single line L corresponding to the fixed values of 0 and t. We 
repeat this process for every value of t and then change the angle 0 and 
repeat again. In this way we obtain the integrals of f over every line L in 
the plane. We denote by rs(0,t) the integral 


rs(0,t) = f fleas = | Ftcos0 — sind, tsind + s cosas 


The function r;(6,t) is called the Radon transform of f. 


Transmission Tomography 163 


y 


FIGURE 11.1: The Radon transform of f at (t, 0) is the line integral of 
f along line L. 


11.5.2 The Central Slice Theorem 


For fixed 0 the function rẹ(0,t) is a function of the single real variable 
t; let Ry(0,w) be its Fourier transform. Then 


Ry(0,w) = froed 


J fre cos0 — ssin 0, tsin + s cos 0)e™*dsdt 


J [tev (x „yje ei (a cos 0+y sin 0) dzdy 


= F(wcos6,wsin6), 


where F(wcos6,wsin@) is the two-dimensional Fourier transform of the 
function f(x,y), evaluated at the point (wcos0,wsin@); this relationship 
is called the Central Slice Theorem. For fixed 0, as we change the value 
of w, we obtain the values of the function F along the points of the line 
making the angle 6 with the horizontal axis. As 6 varies in [0, 7), we get all 
the values of the function F. Once we have F, we can obtain f using the 
formula for the two-dimensional inverse Fourier transform. We conclude 
that we are able to determine f from its line integrals. As we shall see, 
inverting the Fourier transform can be implemented by combinations of 
frequency-domain filtering and back projection. 
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11.6 Inverting the Fourier Transform 


The Fourier-transform inversion formula for two-dimensional functions 
tells us that the function f(x,y) can be obtained as 


1 l 
f(a.) = za | | Penoje e dudo. (11.1) 


We now derive alternative inversion formulas. 


11.6.1 Back Projection 


For 0 < 0 <7 and all real t, let h(0,t) be any function of the variables 
0 and t; for example, it could be the Radon transform. As with the Radon 
transform, we imagine that each pair (0, t) corresponds to one line through 
the x, y-plane. For each fixed point (x, y) we assign to this point the average, 
over all 0, of the quantities h(0,t) for every pair (6,t) such that the point 
(x,y) lies on the associated line. The summing process is integration and 
the back-projection function at (x, y) is 


BP, (x,y) =f h(@,xcos@ + ysin 0)dð. 
0 


The operation of back projection will play an important role in what follows 
in this chapter. 


11.6.2 Ramp Filter, then Back Project 


Expressing the double integral in Equation (11.1) in polar coordinates 
(w,@), with w > 0, u = wcos0, and v = wsin 6, we get 


1 2m co i 
f(x,y) = =f / F(u, vjet) wdurd, 
Tn! Jo 0 


or 
1 MR ER ; 
f(a, y) — =f J F(u, ve utv) |u| dw. 
0 —oo 


Now write 
F(u,v) = F(wcos6,wsin@) = R(0,w), 
where R,(6,w) is the FT with respect to t of rf(0,t), so that 
J F(u, vjet) wdw = i Ry(0,w)|wle!dw. 


—Co 
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The function g/(6,t) defined for t = x cos0 + ysin 0 by 
š 1 z —iwt 
gs(0,x£cos0 + ysin 0) = F- R0, w)jwle dw (11.2) 


is the result of a linear filtering of rs (0, t) using a ramp filter with transfer 
function H (w) = |w|. Then, 


1 1/7 l 
f(x,y) = 57 P Por (ty) = =| gf (6,2 cos 6 + ysin 6)dé 
0 


gives f(x,y) as the result of a back-projection operator; for every fixed value 
of (6,t) add gf(0,t) to the current value at the point (x,y) for all (x,y) 
lying on the straight line determined by 0 and t by t = xcos@+ ysinð. 
The final value at a fixed point (x,y) is then the average of all the values 
gs(0,t) for those (@,t) for which (x,y) is on the line t = x cos0 + ysinð. 
It is therefore said that f(x,y) can be obtained by filtered back-projection 
(FBP) of the line-integral data. 

Knowing that f(x,y) is related to the complete set of line integrals by 
filtered back-projection suggests that, when only finitely many line integrals 
are available, a similar ramp filtering and back-projection can be used to 
estimate f(x,y); in the clinic this is the most widely used method for the 
reconstruction of tomographic images. 


11.6.3 Back Project, then Ramp Filter 


There is a second way to recover f(x,y) using back projection and fil- 
tering, this time in the reverse order; that is, we back project the Radon 
transform and then ramp filter the resulting function of two variables. 
We begin with the back-projection operation, as applied to the function 
h(O,t) = r7(6,t). 

We have 


BP, (xy) = f rs (0,2 cos@ + ysin 0)d0. 
0 


Replacing rs(0,t) with 


1 f” ; 
rD =E | ReO "td, 
and inserting 
Ry(0,w) = F(wcos6,wsin 6), 


and 
t = x cosl + ysin#, 
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T 1 co 7 7 
BP, (x,y) = J (= n F (w cos 8, wsin Be" cos FUP day) a8, 


With u = wcos0 and v = wsin 0, this becomes 


7 1 i F(u, v) —i(ru+yv 
aiea oa R 


= f (=f. G(u, ve" w]dw ) dO 
= =| J G(u, vje“ Ett) dudv. 
2T —oo J—0o 


This tells us that the back projection of r (6, t) is the function g(x, y) whose 
two-dimensional Fourier transform is 


G(u,v) = Z F(u, o)/ V +2. 


Therefore, we can obtain f(x,y) from rf(0,t) by first back projecting 
rf(0,t) to get g(x,y) and then filtering g(x,y) by forming G(u, v), mul- 
tiplying by Vu? + v?, and taking the inverse Fourier transform. 


11.6.4 Radon’s Inversion Formula 


To get Radon’s inversion formula, we need two basic properties of the 
Fourier transform. First, if f (x) has Fourier transform F (y) then the deriva- 
tive f'(x) has Fourier transform —iyF (y). Second, if F(y) = sgn(y), the 


function that is Fil for y £0, ‘ie equal to zero for y = 0, then its inverse 


Fourier transform is f(x) = zz- 
Writing Equation (11.2) as 


il  uf?* ; 
gf (6,t) = =| wR; (0, w)sen(w)e~ "du, 
T Joo 


we see that gy is the inverse Fourier transform of the product of the two 
functions wR(6,w) and sgn(w). Consequently: gf is the onvol umon of their 


individual inverse Fourier transforms, iZry (0,t) and z}; that is, 
1 2 ð 1 
6, s)—d 
ged == f gO) is 


which is the Hilbert transform of the function ory (@,t), with respect to 
the variable t. Radon’s inversion formula is then 


i a 
f(x,y) = F) HT (Zren) dð. 
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11.7 From Theory to Practice 


What we have just described is the theory. What happens in practice? 


11.7.1 The Practical Problems 


Of course, in reality we never have the Radon transform r,(6,t) for 
all values of its variables. Only finitely many angles 0 are used, and, for 
each 6, we will have (approximate) values of line integrals for only finitely 
many t. Therefore, taking the Fourier transform of r;(6,t), as a function of 
the single variable t, is not something we can actually do. At best, we can 
approximate R,(0,w) for finitely many 0. From the Central Slice Theorem, 
we can then say that we have approximate values of F (w cos 6,w sin 0), for 
finitely many 0. This means that we have (approximate) Fourier transform 
values for f(x,y) along finitely many lines through the origin, like the 
spokes of a wheel. The farther from the origin we get, the fewer values 
we have, so the coverage in Fourier space is quite uneven. The low-spatial- 
frequencies are much better estimated than higher ones, meaning that we 
have a low-pass version of the desired f(x,y). The filtered-back-projection 
approaches we have just discussed both involve ramp filtering, in which the 
higher frequencies are increased, relative to the lower ones. This too can 
only be implemented approximately, since the data is noisy and careless 
ramp filtering will cause the reconstructed image to be unacceptably noisy. 


11.7.2 A Practical Solution: Filtered Back Projection 


We assume, to begin with, that we have finitely many line integrals, 
that is, we have values r;(0,t) for finitely many 0 and finitely many t. 
For each fixed @ we estimate the Fourier transform, Rs(0,w). This step 
can be performed in various ways, and we can freely choose the values of 
w at which we perform the estimation. The FFT will almost certainly be 
involved in calculating the estimates of Ry (0,w). 

For each fixed 0 we multiply our estimated values of R/(6,w) by |w| and 
then use the FFT again to inverse Fourier transform, to achieve a ramp 
filtering of r7(@,t) as a function of t. Note, however, that when |w] is large, 
we may multiply by a smaller quantity, to avoid enhancing noise. We do 
this for each angle 0, to get a function of (6,t), which we then back project 
to get our final image. This is ramp filtering, followed by back projection, 
as applied to the finite data we have. 

It is also possible to mimic the second approach to inversion, that is, to 
back project onto the pixels each r(6,t) that we have, and then to perform 
a ramp filtering of this two-dimensional array of numbers to obtain the 
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final image. In this case, the two-dimensional ramp filtering involves many 
applications of the FFT. 

There is a third approach. Invoking the Central Slice Theorem, we can 
say that we have finitely many approximate values of F'(u,v), the Fourier 
transform of the attenuation function f(x,y), along finitely many lines 
through the origin. The first step is to use these values to estimate the 
values of F(u,v) at the points of a rectangular grid. This step involves 
interpolation [157]. Once we have (approximate) values of F(u,v) on a 
rectangular grid, we perform a two-dimensional FFT to obtain our final 
estimate of the (discretized) f(x, y). 


11.8 Some Practical Concerns 


As computer power increases and scanners become more sophisticated, 
there is pressure to include more dimensionality in the scans. This means 
going beyond slice-by-slice tomography to fully three-dimensional images, 
or even including time as the fourth dimension, to image dynamically. This 
increase in dimensionality comes at a cost, however. Besides the increase in 
radiation to the patient, there are other drawbacks, such as longer acquisi- 
tion time, storing large amounts of data, processing and analyzing this data, 
displaying the results, reading and understanding the higher-dimensional 
images, and so on. 


11.9 Summary 


We have seen how the problem of reconstructing a function from line in- 
tegrals arises in transmission tomography. The Central Slice Theorem con- 
nects the line integrals and the Radon transform to the Fourier transform 
of the desired attenuation function. Various approaches to implementing 
the Fourier Inversion Formula lead to filtered-back-projection algorithms 
for the reconstruction. In x-ray tomography, as well as in PET, viewing the 
data as line integrals ignores the statistical aspects of the problem, and in 
SPECT, it ignores, as well, the important physical effects of attenuation. To 
incorporate more of the physics of the problem, iterative algorithms based 
on statistical models have been developed. We consider some of these al- 
gorithms in the books [41] and [42]. 
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12.1 Chapter Summary 


When we sample a function f(x) we usually make some error, and the 
data we get is not precisely f(nA), but contains additive noise, that is, 
our data value is really f(nA) + noise. Noise is best viewed as random, so 
it becomes necessary to treat random sequences f = {fn} in which each 
fn is a random variable. The random variables fp and fm may or may 
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not be statistically independent. In this chapter we survey several topics 
from probability and stochastic processes that are particularly important 
in signal processing. 


12.2 What Is a Random Variable? 


The simplest answer to the question “What is a random variable?” is 
“A random variable is a mathematical model”. Imagine that we repeatedly 
drop a baseball from eye-level to the floor. Each time, the baseball behaves 
the same. If we were asked to describe this behavior with a mathemati- 
cal model, we probably would choose to use a differential equation as our 
model. Ignoring everything except the force of gravity, we would write 


h” (t) = —32 


as the equation describing the downward acceleration due to gravity. Inte- 
grating, we have 
h' (t) = —32t + h’(0) 


as the velocity of the baseball at time t > 0, and integrating once more, 
h(t) = —16t? + h’(0)t + A(0) 


as the equation of position of the baseball at time t > 0, up to the moment 
when it hits the floor. Knowing h(0), the distance from eye-level to the floor, 
and knowing that, since we dropped the ball, h’(0) = 0, we can determine 
how long it will take the baseball to hit the floor, and the speed with which 
it will hit. This analysis will apply every time we drop the baseball. There 
will, of course, be slight differences from one drop to the next, depending, 
perhaps, on how the ball was held, but these will be so small as to be 
insignificant. 

Now imagine that, instead of a baseball, we drop a feather. A few rep- 
etitions are all that is necessary to convince us that the model used for the 
baseball no longer suffices. The factors that we safely ignored with regard 
to the baseball, such as air resistance, air currents, and how the object was 
held, now become important. The feather does not always land in the same 
place, it doesn’t always take the same amount of time to reach the floor, 
and doesn’t always land with the same velocity. It doesn’t even fall in a 
straight vertical line. How can we possibly model such behavior? Must we 
try to describe accurately the air resistance encountered by the feather? 
The answer is that we use random variables as our model. 
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While we cannot predict exactly the place where the feather will land, 
and, of course, we must be careful to specify how we are to determine 
“the place” where it does land, we can learn, from a number of trials, 
where it tends to land, and we can postulate the probability that it will 
land within any given region of the floor. In this way, the place where the 
feather will land becomes a random variable with associated probability 
density function. Similarly, we can postulate the probability that the time 
for the fall will lie within any interval of elapsed time, making the elapsed 
time a random variable. Finally, we can postulate the probability that its 
velocity vector upon hitting the ground will lie within any given set of 
three-dimensional vectors, making the velocity a random vector. On the 
basis of these probabilistic models we can proceed to predict the outcome 
of the next drop. 

It is important to remember that the random variable is the model that 
we set up prior to the dropping of the feather, not the outcome of any 
particular drop. 


12.3 The Coin-Flip Random Sequence 


The simplest example of a random sequence is the coin-flip sequence, 
which we denote by c = {c,}?2_.,. We imagine that, at each “time” n, 
a coin is flipped, and c, = 1 if the coin shows heads, and c, = —1 if the 
coin shows tails. When we speak of this coin-flip sequence, we refer to this 
random model, not to any specific sequence of ones and minus ones; the 
random coin-flip sequence is not, therefore, a particular sequence, just as a 
random variable is not actually a specific number. Any particular sequence 
of ones and minus ones can be thought of as having resulted from such an 
infinite number of flips of the coin, and is called a realization of the random 
coin-flip sequence. 

It will be convenient to allow for the coin to be biased, that is, for 
the probabilities of heads and tails to be unequal. We denote by p the 
probability that heads occurs and 1 — p the probability of tails; the coin is 
called unbiased or fair if p = 1/2. To find the expected value of cn, written 
E(c¢n), we multiply each possible value of cn by its probability and sum; 
that is, 

E(cn) = (+1)p + (-1)(1 — p) = 2p- 1. 
If the coin is fair then E(c,) = 0. The variance of the random variable cn, 
measuring its tendency to deviate from its expected value, is var(cn) = 
E([¢n — E(cn)]?). We have 


var(cn) = [+1 — (2p — 1)]?p + [-1 — (2p — 1)? (1 — p) = 4p — 4p’. 
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If the coin is fair then var(c,) = 1. It is important to note that we do not 
change the coin at any time during the generation of a realization of the 
random sequence c; in particular, the p does not depend on n. 

Also, we assume that the random variables cn are statistically indepen- 
dent. This means that, for any N, any choice of “times” nj,...,.ny, and 
any values m,...,my in the set {—1,1}, the probability that c,, = mı, 
Cno = M2,..-5 Cay = My is the product of the individual probabilities. For 
example, the probablity that c1 = —1, cg = +1 and cy = +1 is (1 — p)p?. 


12.4 Correlation 


Let u and v be (possibly complex-valued) random variables with ex- 
pected values E(u) and E(v), respectively. The covariance between u and 
v is defined to be 


cov(u,v) = E( (u - B(u))@— BQ), 
and the cross-correlation between u and v is 
corr(u,v) = E(u). 


It is easily shown that cov(u,v) = corr(u,v) — E(u)E(v). When u = v 
we get cov(u,u) = var(u) and corr(u,u) = E(\ul?). If E(u) = E(v) = 0 
then cov(u, v) = corr(u, v). In statistics the “correlation coefficient” is the 
quantity cou(u,v) divided by the standard deviations of u and v. 

When u and v are independent, we have 


E(u) = E(u) E(B), 
and 
B((u- E(u))(@— EQ) = E (u - E(u) E (W EW) =0. 


To illustrate, let u = cn and v = Cn—m. Then, if the coin is fair, E(cn) = 
E(Cn—m) = 0 and 


COU(Cn, Cn—m) = COrT (Cn; Cn-m) = E(Cntn-m).- 


Because the cn are independent, E(Cntn-m) = 0 for m not equal to 0, and 
E(|cn|?) = var(en) = 1. Therefore 


COU(Cn; Cn—m) = COrr (Cn, Cn—m) = 0, form Æ 0, 
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and 
cov(Cn, Cn) = corr(€n, Cn) = 1. 
In the next section we shall use the random coin-flip sequence to gen- 


erate a wide class of random sequences, obtained by viewing c = {cn} as 
the input into a shift-invariant discrete linear filter. 


12.5 Filtering Random Sequences 


Suppose, once again, that T is a shift-invariant discrete linear filter 
with impulse-response sequence g. Now let us take as input, not a particular 
sequence, but the random coin-flip sequence c, with p = 0.5. The output will 
therefore not be a particular sequence either, but will be another random 
sequence, say d. Then, for each n the random variable dn is 


Co 


dn = 5 mIn- miT 5 Omenim: (12.1) 


m=— oo m=— o0 


We compute the correlation corr(dn, dn-m) = E(dndn—m). Using the con- 
volution formula Equation (12.1), we find that 


Co Co 
corr(dn,dn—m) = 5 5 JKJJCOTT(Cn—k, Cn-m—j)- 


k=—oœ J=—00 
Since 
corr(Cn—k;Cn—m—j) = 0, fork £ m + j, 
we have 
corr(dn, dn-m) = 5 9kGk—m- (12.2) 
k=— 0 


The expression of the right side of Equation (12.2) is the definition of the 
autocorrelation of the non-random sequence g, denoted pg = {p,(m)}; that 
is, 


palm) = X IIe: (12.3) 


k=—0o 


It is important to note that the expected value of dn is 


CO 


E(dn)= XD grE(cn—r) =0 


k=—0o 
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and the correlation corr(dn,dn—m) depends only on m; neither quantity 
depends on n and the sequence d is therefore called wide-sense stationary. 
Let’s consider an example. 


12.6 An Example 


Take go = gı = 0.5 and gk = 0 otherwise. Then the system is the 
two-point moving-average, with 


dn = 0.5cn + 0.5Cn—1. 


In the case of the random-coin-flip sequence c each cn is unrelated to all 
other cm; the coin flips are independent. This is no longer the case for the 
dn; one effect of the filter g is to introduce correlation into the output. To 
illustrate, since dọ and dı both depend, to some degree, on the value co, 
they are related. Using Equation (12.3) we have 


corr(dn,dn) = pg(0) = gogo + gigi = 0.25 + 0.25 = 0.5, 


corr(dn,dn4+1) = Pg(—1) = gogi = 0.25, 
corr(dn,dn—1) = pg(+1) = gigo = 0.25, 


and 

corr(dn,dn—m) = Pg(m) = 0, otherwise. 
So we see that dn and dn-m are related, for m = —1,0,+1, but not other- 
wise. 


12.7 Correlation Functions and Power Spectra 


As we have seen, any non-random sequence g = {gn} has its autocor- 
relation function defined, for each integer m, by 


plm) = XO Kim: 


k=—0o 


For a random sequence d, that is wide-sense stationary, its correlation 
function is defined to be 
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The power spectrum of g is defined for w in [—7,7] by 


Co 


Rg(w) = 5 pg(mje™. 


m=— oo 


It is easy to see that 


where s 
G(w) = D gne”, 
so that R,(w) > 0. The power spectrum of the random sequence d = {dn} 


is defined as Bs 


Ra(w) = 5 pa(m)e™. 
m=—oo 
Although it is not immediately obvious, we also have Ra(w) > 0. One way 
to see this is to consider 


DWw)= So dne™ 
and to calculate 
EDW) = Y E(dndyon)e™ = Ra(w). 


m=— oo 


Given any power spectrum Ra(w) > 0 we can construct G(w) by selecting 
an arbitrary phase angle 0 and letting 


G(w) = /Ra(w)e”. 


We then obtain the non-random sequence g associated with G(w) using 
=, 1 f G —inwd 
Gn = 5 = (w)e w. 

It follows that pg(m) = pa(m) for each m and R,(w) = Ra(w) for each w. 

What we have discovered is that, when the input to the system is the 
random-coin-flip sequence c, the output sequence d has a correlation func- 
tion pa(m) that is equal to the autocorrelation of the sequence g. As we just 
saw, for any wide-sense stationary random sequence d with expected value 
E(d,,) constant and correlation function corr(dn,dn—m) independent of n, 
there is a shift-invariant discrete linear system T with impulse-response 
sequence g, such that p,(m) = pa(m) for each m. Therefore, any wide- 
sense stationary random sequence d can be viewed as the output of a shift- 
invariant discrete linear system, when the input is the random-coin-flip 
sequence c = {cn}. 
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12.8 The Dirac Delta in Frequency Space 


Consider the “function” defined by the infinite sum 
OO 


(wu) = = 5 iun ps oe (12.4) 


This is a Fourier series in which all the Fourier coefficients are one. The 
series doesn’t converge in the usual sense, but still has some uses. In par- 
ticular, look what happens when we take 


Fw) = $ fne”, 


n=—Cco 


for m < w <7, and calculate 


L F(w)ô(w)dw = 5 + : F(wje "dw. 


RN n=—0o 


We have ` es 
Floidu = 5— Ð fn) = F0), 


where the f(n) are the Fourier coefficients of F (w). This means that 5(w) 
has the sifting property, just like we saw with the Dirac delta d(x); that is 


why we call it 6(w). When we shift d(w) to get d(w — a), we find that 


re F(w)d(w — a)dw = F(a). 


— 


The “function” 6(w) is the Dirac delta for w space. 


12.9 Random Sinusoidal Sequences 


Consider A = |Ale’’, with amplitude |A| a positive-valued random vari- 
able and phase angle 0 a random variable taking values in the interval 
[—7, r]; then A is a complex-valued random variable. For a fixed frequency 
wo we define a random sinusoidal sequence s = {sn} by sn = Ae’. 
We assume that 0 has the uniform distribution over [—7,7] so that the 
expected value of s, is zero. The correlation function for s is 


ps(m) = E(8n3nam) = E(|A|*)e"*"™”? 
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and the power spectrum of s is 


Re(w) = E(|A?) D> emo), 


so that, by Equation (12.4), we have 
R,(w) = 2 E(|A|?)6(w — wo). 


We generalize this example to the case of multiple independent sinusoids. 
Suppose that, for j = 1,..., J, we have fixed frequencies wj and indepen- 
dent complex-valued random variables A;. We let our random sequence be 
defined by 


$= > Aje, 


j=1 
Then the correlation function for s is 


J 


ps(m) = X E(A;|?)e 


j=1 
and the power spectrum for s is 


J 
Rs(w) = 27X E(|Ajl?)5(w — w). 


j=1 


This is the commonly used model of independent sinusoids. The problem of 
power spectrum estimation is to determine the values J, the frequencies wj 
and the variances E(|A;|”) from finitely many samples from one or more 
realizations of the random sequence s. 


12.10 Random Noise Sequences 


Let q = {qn} be an arbitrary wide-sense stationary discrete random se- 
quence, with correlation function p,(m) and power spectrum R,(w). We say 
that q is white noise if p,(m) = 0 for m not equal to zero, or, equivalently, 
if the power spectrum R,(w) is constant over the interval [—7, 7]. The in- 
dependent sinusoids in additive white noise model is a random sequence of 


the form 
J 


tn =) Age + dn. 
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The signal power is defined to be p,(0), which is the sum of the E(|Aj|*), 
while the noise power is p,(0). The signal-to-noise ratio (SNR) is the ratio 
of signal power to noise power. 


12.11 Increasing the SNR 


It is often the case that the SNR is quite low and it is desirable to process 
the data from x to enhance this ratio. The data we have is typically finitely 
many values of one realization of x. We say we have fn for n = 1,2,...,.N; 
we don’t say we have £n because xy, is the random variable, not one value 
of the random variable. One way to process the data is to estimate p,(m) 
for some small number of integers m around zero, using, for example, the 
lag products estimate 


N-m 


px(m) = — 5 Fn Ph 


n=1 
for m = 0,1,..., M < N and f,(—m) = f,(m). Because p,(m) = 0 for m 
not equal to zero, we will have 6,(m) approximating ps(m) for nonzero val- 
ues of m, thereby reducing the effect of the noise. Therefore, our estimates 
of ps(m) are relatively noise-free for m # 0. 


12.12 Colored Noise 


The additive noise is said to be correlated or non-white if it is not 
the case that ps(m) = 0 for all nonzero m. In this case the noise power 
spectrum is not constant, and so may be concentrated in certain regions of 
the interval [—7, 7]. 

The next few sections deal with applications of random sequences. 


12.13 Spread-Spectrum Communication 


In this section we return to the random-coin-flip model, this time al- 
lowing the coin to be biased, that is, p need not be 0.5. Let s = {sn} be 
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a random sequence, such as sn = Ae’”“°, with E(s,) = p and correlation 
function ps(m). Define a second random sequence x by 


Ln = Sar 


The random sequence x is generated from the random signal s by randomly 
changing its signs. We can show that 


E(&n) = p(2p — 1) 


and, for m not equal to zero, 


px(m) = ps(m)(2p — 1)?, 


with 
Pa(0) = ps(0) + 4p(1 — p)p?. 


Therefore, if p = 1 or p = 0 we get pz(m) = ps(m) for all m, but for 
p = 0.5 we get pr(m) = 0 for m not equal to zero. If the coin is unbiased, 
then the random sign changes convert the original signal s into white noise. 
Generally, we have 


Rz(w) = (2p — 1)°Rs(w) + (1 — (2p — 1)?)(u? + ps(0)), 


which says that the power spectrum of x is a combination of the signal 
power spectrum and a white-noise power spectrum, approaching the white- 
noise power spectrum as p approaches 0.5. If the original signal power spec- 
trum is concentrated within a small interval, then the effect of the random 
sign changes is to spread that spectrum. Once we know what the particular 
realization of the random sequence c is that has been used, we can recap- 
ture the original signal from Sn = £ncn. The use of such a spread spectrum 
permits the sending of multiple narrow-band signals, without confusion, as 
well as protecting against any narrow-band additive interference. 


12.14 Stochastic Difference Equations 


The ordinary first-order differential equation y’(t) + ay(t) = f(t), with 
initial condition y(0) = 0, has for its solution y(t) = e7% (fs e%$ f(s)ds. 
One way to look at such differential equations is to consider f(t) to be 
the input to a system having y(t) as its output. The system determines 
which terms will occur on the left side of the differential equation. In many 
applications the input f(t) is viewed as random noise and the output is then 
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a continuous-time random process. Here we want to consider the discrete 
analog of such differential equations. 

We replace the first derivative with the first difference, yn+1—Yn and we 
replace the input with the random-coin-flip sequence c = {cn}, to obtain 
the random difference equation 


Yn+1 — Yn + ayn = Cn. 
With b = 1— a and 0 < b < 1 we have 
Yn+1 — byn = Crn. (12.5) 


The solution is y = {yn} given by 


yn =! XO bep. (12.6) 


Comparing this with the solution of the differential equation, we see that 
the term b”~! plays the role of e7°%t = (e~*)*, so that b = 1 — a is substi- 
tuting for e~*. The infinite sum replaces the infinite integral, with b~*c;, 
replacing the integrand e“* f(s). 

The solution sequence y given by Equation (12.6) is a wide-sense sta- 
tionary random sequence and its correlation function is 


py(m) = b"/(1 — 0?) 


Since 


the random sequence (1— b)yn = ayn is an infinite moving-average random 
sequence formed from the random sequence c. 

We can derive the solution in Equation (12.6) using z-transforms. We 
write 


Y(z) = Se Ynz ”, 
and m 
C(z) = 5 ena” 


or 
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Expanding in a geometric series, we get 
Y(z)= Aei + bz) + bz? D 


from which the solution given in Equation (12.6) follows immediately. 


12.15 Random Vectors and Correlation Matrices 


In estimation and detection theory, the task is to distinguish signal 
vectors from noise vectors. In order to perform such a task, we need to know 
how signal vectors differ from noise vectors. Most frequently, what we have 
is statistical information. The signal vectors of interest, which we denote by 
s = (s1, ..., SN)! , typically exhibit some patterns of behavior among their 
entries. For example, a constant signal, such as s = (1,1,...,1)”, has all its 
entries identical. A sinusoidal signal, such as s = (1,—1,1,—1,...,1,—-1)7, 
exhibits a periodicity in its entries. If the signal is a vectorization of a two- 
dimensional image, then the patterns will be more difficult to describe, 
but will be there, nevertheless. In contrast, a typical noise vector, denoted 
q= (qı, qN)”, may have entries that are statistically unrelated to each 
other, as in white noise. Of course, what is signal and what is noise depends 
on the context; unwanted interference in radio may be viewed as noise, even 
though it may be a weather report or a song. 

To deal with these notions mathematically, we adopt statistical models. 
The entries of s and q are taken to be random variables, so that s and q are 
random vectors. Often we assume that the mean values, E(s) and E(q), 
are both equal to the zero vector. Then patterns that may exist among 
the entries of these vectors are described in terms of correlations. The 
noise covariance matrix, which we denote by Q, has for its entries Qmn = 


B((am — E(dm)) (dn — E(an))): for m,n = 1,..., N. The signal covariance 


matrix is defined similarly. If E(qn) = 0 and E(|qn|?) = 1 for each n, 
then Q is the noise correlation matrix. Such matrices Q are Hermitian and 
nonnegative definite, that is, x'@Qx is nonnegative, for every vector x. If Q 
is a positive multiple of the identity matrix, then the noise vector q is said 
to be a white noise random vector. 
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12.16 The Prediction Problem 


An important problem in signal processing is the estimation of the next 
term in a sequence of numbers from knowledge of the previous values. 
This is called the prediction problem. The numbers might be the values at 
closing of a certain stock market index; knowing what has happened up 
to today, can we predict, with some accuracy, tomorrow’s closing value? 
The numbers might describe the position in space of a missile; knowing 
where it has been for the past few minutes, can we predict where it will 
be for the next few? The numbers might be the noon-time temperature in 
New York City on successive days; can we predict tomorrow’s temperature 
from our knowledge of the temperatures on previous days? It is helpful, in 
weather prediction and elsewhere, to use not only the previous values of the 
sequence of interest, but those of related sequences; the recent temperatures 
in Pittsburgh might be helpful in predicting tomorrow’s weather in New 
York City. In this chapter we begin a discussion of the prediction problem. 


12.17 Prediction Through Interpolation 


Suppose that our data are the real numbers 7j,...,%», corresponding 
to times t = 1,...,m. Our goal is to estimate 2,41. One way to do this is 
by interpolation. 

A function f(t) is said to interpolate the data if f(n) = x, for n = 
1,...,m. Having found such an interpolating function, we can take as our 
prediction of 41 the number ĉm+ı = f(m + 1). Of course, there are 
infinitely many choices for the interpolating function f(t). In our discussion 
of Fourier transform estimation, we considered methods of interpolation 
that incorporated prior knowledge about the function being sampled, such 
as that it was band-limited. In the absence of such additional information 
polynomial interpolation is one obvious choice. 

Polynomial interpolation involves selecting as the function f(t) the poly- 
nomial of least degree that interpolates the data. Given m data points, we 
seek a polynomial of degree m — 1. Lagrange’s method is a well-known 
procedure for solving this problem. 

For k = 1,...,m, let L(t) be the unique polynomial of degree m — 1 
with the properties L,(k) = 1 and Ly(n) = 0 for n =1,..,m and nF k. 
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We can write each L;,(t) explicitly, since we know its zeros: 


na- CD = = D-k) (t-m) 
(kA) k kk (RFD) (Rm) 


Then the polynomial 

m 

Palt) = X ap Le(t) 

k=1 
is the interpolating polynomial we seek. 
Ex. 12.1 Show that form = 1 the predicted value of x2 is £2 = xı, so that 

Êo — T1 = 0. 

This is the “Tomorrow will be like today” prediction. 


Ex. 12.2 Show that for m = 2 the predicted value of x3 is 3 = 242 — x1, 
or ĉ3 — £2 = (£2 — x1) so that 


£3 — 2z + x1 = Q. 


This prediction amounts to assuming the change from today to tomorrow 
will be the same as the change from yesterday to today; that is, we assume 
a constant slope. 


Ex. 12.3 Show that for m = 3 the predicted value of x4 is ĉ4 = 3x3 — 
322 + xı, so that 
ĉ4 — 3434+ 3T2 — T1 =0. 


Ex. 12.4 The coefficients in the previous exercises fit a pattern. Using 
this pattern, determine the predicted value of x5 for the case of m = 4. In 
general, what will be the predicted value of tm+1 based on the m previous 
values? 


The concept of divided difference plays a significant role in interpola- 
tion, as we shall see. 


12.18 Divided Differences 


The zeroth divided difference of a function f(t) with respect to the point 
to is f [to] = f(to). The first divided difference with respect to the points to 


and tı is 
F(t) = f(to) 


to, ti| = 
Fito, t1] a 
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The mth divided difference with respect to the points to,..., tm is 


flti, > tm] — f[to, so tm=1) 


boast] = 
f[to, Sera] ta t 


These quantities are discrete analogs of the derivatives of a function. Indeed, 
if f(t) is a polynomial of degree at most m — 1 then the mth divided 
difference is zero, for any points to, ..., tm- 

When the points to,...,¢m are consecutive integers the divided differ- 
ences take on a special form. Suppose to = 1, tı = 2,...,tm = M + 1. Then, 


flto,t1] = FQ) - £0) 
fltostr,t2] = 5/8) — 240) + FD): 


fit, titat] = F(A = 3F03) +3F) ~ FD) 


and so on, with each successive divided difference involving the coefficients 
in the expansion of the binomial (a — b)*. 

For each fixed value of m > 1 and 1 < n < m, we have f(n) = x, and 
f(m+1) = &m41. According to the previous exercises, for m = 1 we can 
write 

Lo 7M = 0, 


which says that the first divided difference is zero; that is, f[1,2] = 0. For 
m = 2 we have 
[ês — x9] => [x2 = xı] = 0, 


or f[1,2,3] = 0, so the second divided difference is zero. For m = 3 


[[@4 — x3] — [z3 — x2]] — [[£3 — x2] — [v2 — xı]] = 0, 


which says that the third divided difference, f[1,2,3,4], is zero. The in- 
terpolation is achieved by assuming that the m data points as well as the 
point to be interpolated lie on a polynomial of degree at most m — 1. Un- 
der this assumption the mth divided difference with respect to the points 
1,2,...,m-+1 would be zero. The interpolated value can then be calculated 
by setting the mth divided difference equal to zero, but replacing £m+1 
with the estimate ĉm+1- 

The coefficients that occur in these various predictors are those in the 
expansion of the binomial (a — b)”. To investigate this matter further, we 
define the first difference operator on an arbitrary sequence z = {£n} to 
be the operator D such that y = Dax, where y = {yn} is the sequence 
with entries Yn, = £n — £n—1. Notice that the operator D can be written 
as D = I — S, where I is the identity operator and S is the shift operator; 
that is, Sa = z where z = {zn} is the sequence with entries zn = £n-1. 
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The kth difference operator is D* = (I — $)*; expanding this product in 
terms of powers of S leads to the binomial coefficients that we saw earlier. 

This method of predicting using the interpolating polynomial of degree 
m — 1 will be perfectly accurate if the sequence {£n} is formed by taking 
values from a polynomial of degree m—1 or less. Typically, our data contains 
noise and interpolating the data exactly, while theoretically possible, is not 
wise or useful. 

The prediction method used here is linear in the sense that our predicted 
value is a linear combination of the data values and the coefficients we use 
do not involve the data. Another approach, linear predictive coding, is 
somewhat different. 


12.19 Linear Predictive Coding 


Suppose once again that we have the data 2,...,%, and we want 
to predict %m41. Instead of using a linear combination of all the values 
X1,-.-;Lm, we choose to use as our prediction of £m+1 a linear combination 
of Fm—p,Lm—p+1,++-;Lm, Where p is a positive integer much smaller than 
m. So, our prediction has the form 


Em+41 = tüf + A1Lm—1 +... + ApLm—p- 


To find the best coefficients ao, ...,@p) to use, we imagine trying out each 
possible choice of coefficients, using them to predict data values we already 
know. Specifically, for each set of coefficients {ao,...,ap)}, we form the pre- 
dictions 

pja = AoUp41 Faily Falp- +... + ApX1, 


Tp+3 = AQLp+2 + A1XLp41 + AgLp + ... + ApX2, 


and so on, down to 


Lm = ALm—1 + A1Lm—2 +... + ApLm—(p+1)- 


Since we already know what the true values are, we can compare the pre- 
dicted values with the true ones and then find the choice of coefficients 
that minimizes the average squared error. This amounts to finding the 
least- squares solution of the system of equations obtained by replacing the 
predictions with the true values on the left side of the previous equations: 
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Tp+1 Tp ťi ao Tp+2 
Tp+2 Tp+1 T2 ay Tp+3 
= , 
Lm Tm-—1 ++ Lm—p-1 Qp Tm 


which we write as Ga = b. Since m is typically larger than p, this system 
is overdetermined. The least-squares solution is 


a = (G'G)"'Gib 


The resulting set of coefficients is then used to make a linear combination 
of the values £m, ...,&m—-p, Which is then our predicted value. But note 
that although a linear combination of data forms the predicted value, the 
coefficients are determined from the data values themselves, so the overall 
method is nonlinear. 

This method of prediction forms the basis of a data-compression tech- 
nique known as linear predictive coding (LPC). In many applications a 
long sequence of numbers has a certain amount of local redundancy, and 
many of the values can be well predicted from a small number of previous 
ones, using the method just described. Instead of transmitting the entire 
sequence of numbers, only some of the numbers, along with the coefficients 
and occasional outliers, are sent. 

The entry in the kth row, nth column of the matrix G'G is 


m—p 
(GG) kn =) Tp+l—k+jTp+1-n+j: 
j=1 


If we view ane data as values of a stationary random process, then the 
quantity —— mop (Gl G)xgn is an estimate of the autocorrelation value rg(n— k). 
Similarly, thie kth entry of the vector G'b is 


m—p 


(Gtb)x = 5 Tp+1—k+jp+1+j> 
j=1 


mT (Gtb);, is an estimate of re(—k), for k = 1, ...,p+1. This brings us 
to the problem of predicting the next value for a (possibly nonstationary) 
random process. 
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12.20 Discrete Random Processes 


The most common model used in signal processing is that of a sum of 
complex exponential functions plus noise. The noise is viewed as a sequence 
of random variables, and the signal components also may involve random 
parameters, such as random amplitudes and phase angles. Such models are 
best studied as particular cases of discrete random processes. 

A discrete random process is an infinite sequence {X,}*72° n in which 
each Xn is a complex-valued random variable. The autocorrelation function 
associated with the random process is defined for all index values m and n 
by r2(m,n) = E(XmXn), where E(-) is the expectation or expected value 
operator. For m = n we get r(n,n) = E(|Xn|?). Generally, we have 


variance(Xn) = E(|Xn — E(Xn)|?) = E(|Xn|?) —|E(Xn) |’. 


12.20.1 Wide-Sense Stationary Processes 


We say that the random process is wide-sense stationary if E(X») 
is independent of n and r,(m,n) is a function only of the difference, 
m — n. Since E(X,,) does not depend on n, it is common to assume 
that this constant mean has been subtracted, so that E(X,,) = 0. Then 
variance(X,) = E(|X,,|?), which is independent of n as well. For the re- 
mainder of this chapter all random processes will be wide-sense stationary. 

For wide-sense stationary processes the autocorrelation function be- 
comes r,(k) = E(Xn+kXn), so that r,(0) is the constant variance of the 
Xn. The power spectrum R,(w) of the random process is defined using the 
values r,(k) as its Fourier coeffcients: 


Rew) => re (ket, 


k=—0o 


for all w in the interval [—7, 7]. It can be proved that the power spectrum 
is a nonnegative function of the form R,(w) = |G(w)|? and the autocorre- 
lation sequence {rz,(k)} satisfies the equations 


+00 
ro(k) = eae Gk+nGn; 
for z 
G(w) = yo gnje”. 


In practice we will have actual values Xn = £n, for only finitely many of the 
Xn, say for n = 1,...,m. These can be used to estimate the values r,(k), at 
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least for values of k between, say, —M/5 and M/5. For example, we could 
estimate r,(k) by averaging all the products of the form £k}m£m that we 
can compute from the data. Clearly, as k gets farther away from zero we 
have fewer such products, so our average is a less accurate estimate. 

Once we have r,(k), |k| < N, we form the N+1 by N+1 autocorrelation 
matric R having the entries Ryn = Tx(m—n). This autocorrelation matrix 
is what is used in the design of optimal filtering. 

The matrix R is Hermitian, that is, Rn,m = Rmn, so that Rt = R. An 
M by M Hermitian matrix H is said to be nonnegative definite if, for all 
complex column vectors a = (a1, ..., am), the quadratic form at Ha is a 
nonnegative number and positive definite if such a quadratic form is always 
positive, when a is not zero. 


Ex. 12.5 Show that the autocorrelation matrix R is nonnegative definite. 
Under what conditions can R fail to be positive definite? Hint: Let 
N+1 


A(w) = 5 ine 


n=1 
and express the integral 


J APRo 
in terms of the an and the Rmn- 


In Chapter 13 we shall consider the mazimum entropy method for esti- 
mating the power spectrum from finitely many values of r,(k). 


12.20.2 Autoregressive Processes 


We noted previously that the case of a discrete-time signal with addi- 
tive random noise provides a good example of a discrete random process; 
there are others. One particularly important type is the autoregressive (AR) 
process, which is closely related to ordinary linear differential equations. 

When a smooth periodic function has noise added the new function 
is rough. Imagine, though, a fairly weighty pendulum of a clock, moving 
smoothly and periodically. Now imagine that a young child is throwing 
small stones at the bob of the pendulum. The movement of the pendulum 
is no longer periodic, but it is not rough. The pendulum is moving randomly 
in response to the random external disturbance, but not as if a random noise 
component has been added to its motion. To model such random processes 
we need to extend the notion of an ordinary differential equation. That 
leads us to the AR processes. 

Recall that an ordinary linear Mth order differential equation with con- 
stant coefficients has the form 


LOD (t) + cya (t) + cpa) (t) +... + cm-ix' (t) + cme(t) = f(t), 


Random Sequences 189 


where «()(t) denotes the mth derivative of the function x(t) and the cm 
are constants. In many applications the variable t is time and the function 
f(t) is an external effect driving the linear system, with system response 
given by the unknown function x(t). How the system responds to a variety 
of external drivers is of great interest. It is sometimes convenient to re- 
place this continuous formulation with a discrete analog called a difference 
equation. 

In switching from differential equations to difference equations, we dis- 
cretize the time variable and replace the driving function f(t) with fn, 
x(t) with zn, the first derivative at time t, 2’(t), with the first differ- 
ence, Zn — Ln—1, the second derivative x”(t) with the second difference, 
(£n — Ln—-1) — (Wn—1 — Ln—2), and so on. The differential equation is then 
replaced by the difference equation 


Ln — A1Tn—1 — AQX%n-—2 — ... — AM Tn-M = tn (12.7) 


for some constants am; the negative signs are a technical convenience only. 

We now assume that the driving function is a discrete random pro- 
cess {fn}, so that the system response becomes a discrete random process, 
{Xn}. If we assume that the driver fn is a mean-zero white noise process 
that is independent of the {Xn}, then the process {Xn} is called an au- 
toregressive (AR) process. What the system does at time n depends partly 
on what it has done at the M discrete times prior to time n, as well as 
partly on what the external disturbance fn is at time n. Our goal is usu- 
ally to determine the constants am; this is system identification. Our data 
is typically some number of consecutive measurements of the Xp. 

Multiplying both sides of Equation (12.7) by Xn-ķ, for some k > 0 and 
taking the expected value, we obtain 


E(XnXn_k) pea Sane amE(Xn—-mMXn_k) = 0, 


or 
Telk) — air,(k — 1) —...-—aure(k — M) =0. 


Taking k = 0, we get 
rz (0) — ayr2(—1) — ... — aure(—M) = E(|fn|") = var (fn). 


To find the am we use the data to estimate r,(k) at least for k = 0, 1, ..., M. 
Then, we use these estimates in the previous linear equations, solving them 
for the am. 


12.20.3 Linear Systems with Random Input 


In our discussion of discrete linear filters, also called time-invariant lin- 
ear systems, we noted that it is common to consider as the input to such 
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a system a discrete random process, {Xn}. The output is then another 
random process {Y,,} given by 


+00 
Y= 5 JmXn-m, 


m=—oo 
for each n. 


Ex. 12.6 Show that if the input process is wide-sense stationary then so 
is the output. Show that the power spectrum R,(w) of the output is 


R,(w) = |G) Re (w). 


12.21 Stochastic Prediction 


In time series analysis, stochastic prediction methods are studied. In 
that case the numbers x, are viewed as values of a discrete random process 
{Xn}. The coefficients are determined by considering the statistical de- 
scription of how the random variable Xm+1 is related to the previous Xj. 
The prediction of Xm+1 is a linear combination of the random variables 
Xn, N= 1,...,M, 


Xm+41 = aoXm a a,Xm-1 Tost Qm—-1X1, 


with the coefficients determined using the orthogonality principle. Conse- 
quently, the coefficients satisfy the system of linear equations 


E(Xm41Xk) = do (XmXkz) +... + @m—1E(X1 Xk), 


for k = 1,2,...,m. The expected values in these equations are the autocor- 
relations associated with the random process. 


12.21.1 Prediction for an Autoregressive Process 


Suppose that the random process {X,,} is an Mth order AR process, 
so that 
Xn = ayXn-1 Zan aMXn-M = Tri 


where {fn} is a mean-zero white noise process, independent of the {Xn}. 


Ex. 12.7 Use our earlier discussion of the relationship between the au- 
tocorrelation values rz(k) and the coefficients am to show that the best 
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linear predictor for the random variable Xn in terms of the values of 
Xn—1; ++) Xn—-M 1s 


Xn = ay Xn-1 Tast aMXn-M 
and the mean-squared error is 


B(\Xn — Xal’) = var (fn). 


In fact, it can be shown that, because the process is an Mth order AR 
process, this is the best linear predictor of X,, in terms of the entire history 
of the process. 


Chapter 13 


Nonlinear Methods 


13.1 
13.2 
13.3 
13.4 
13.5 
13.6 
13.7 
13.8 
13.9 
13.10 


13.11 
13.12 
13.13 
13.14 
13.15 
13.16 
13.17 
13.18 


13.19 
13.20 
13.21 
13.22 
13.23 
13.24 
13.25 


Chapter Stinimary rse pna te daea hoc eaten scans e teens T 194 
The Classical Methods ............... cece cece cece cnet e teen eee 194 
Modern Signal Processing and Entropy .................:e eee ee 194 
Related Methods ........... 0... c cece cee n eet e ence eee ee teen eees 195 
Entropy Maximization ........... 0... 196 
Estimating Nonnegative Functions ...................... 0... 197 
Philosophical Issues .......... 0.0. cece cece eee e eect ene e teen ees 197 
The Autocorrelation Sequence {r(n)} ....... cece ee eee eee ee eee 199 
Minimum-Phase Vectors ............0. sce e eee e eee n eee e een ees 200 
Büte SMEM. 255k ste ates hadnt tie neuer E GA EAE E ts 200 
13.10.1 The Minimum-Phase Property ................ 00... eee 202 
13.10.2 Solving Ra = 6 Using Levinson’s Algorithm ............ 203 
A Sufficient Condition for Positive-Definiteness ................. 204 
The IPD EP eoa faye terguitvasien. dee RON Ae nth pails east 206 
The Need for Prior Information in Nonlinear Estimation ....... 207 
What Wiener Filtering Suggests .......... 0. cece cece eee eee 208 
Using: a Prior Estimates... cece cet eat se tueis perso ediedes ones 211 
Properties of the IPDFT .............. 0. ccc cece cence eee ences 212 
WMüstraätions: =. fst te Ne iad laden Ste ota eee tee Sa aN E E 213 
Fourier Series and Analytic Functions ................ 0c eee eee eee 213 
131821) An Example: gas co..caer.. eds sence ge baae a Gd Hei ed 214 
13.18.2 Hyperfunctions ......... 0. ccc cece eee eens 217 
Fejér—Riesz Factorization ....... 0.0... cece eee cee cece eee ences 219 
Bürge Dntropy aE aE hia ee ee 220 
Some Eigenvector Methods .............. cece cece eee e ences 221 
The Sinusoids-in-Noise Model ............ 0. 0c cece cece e eee ences 221 
Autocorrelation 22.208 webs seein E E ae EE A R ie oes 222 
Determining the Frequencies ............ 0. cee eee eens 223 
The Case of Non-White Noise ........... 00. cece eee eee eens 224 


193 


194 Signal Processing: A Mathematical Approach 


13.1 Chapter Summary 


It is common to speak of classical, as opposed to modern, signal process- 
ing methods. In this chapter we describe briefly the distinction. Then we 
discuss entropy maximization, eigenvector methods, and related nonlinear 
methods in signal processing. We first encounter infinite series expansions 
for functions in calculus when we study Maclaurin and Taylor series. Fourier 
series are usually first met in different contexts, such as partial differential 
equations and boundary value problems. Laurent expansions come later 
when we study functions of a complex variable. There are, nevertheless, 
important connections among these different types of infinite series expan- 
sions that we consider in this chapter. 


13.2 The Classical Methods 


In [48] Candy locates the beginning of the classical period of spectral 
estimation in Schuster’s use of Fourier techniques in 1898 to analyze sun- 
spot data [138]. The role of Fourier techniques grew with the discovery, by 
Wiener in the USA and Khintchine in the USSR, of the relation between the 
power spectrum and the autocorrelation function. Much of Wiener’s impor- 
tant work on control and communication remained classified and became 
known only with the publication of his classic text Time Series in 1949 
[162]. The book by Blackman and Tukey, Measurement of Power Spectra 
[10], provides perhaps the best description of the classical methods. With 
the discovery of the FFT by Cooley and Tukey in 1965, all the pieces were 
in place for the rapid development of this DFT-based approach to spectral 
estimation. 


13.3 Modern Signal Processing and Entropy 


Until about the middle of the 1970s most signal processing depended al- 
most exclusively on the DFT, as implemented using the FFT. Algorithms 
such as the Gerchberg-Papoulis bandlimited extrapolation method were 
performed as iterative operations on finite vectors, using the FFT at ev- 
ery step. Linear filters and related windowing methods involving the FFT 
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were also used to enhance the resolution of the reconstructed objects. The 
proper design of these filters was an area of interest to quite a number of 
researchers, John Tukey among them. Then, around the end of that decade, 
interest in entropy maximization began to grow, as researchers began to 
wonder if high-resolution methods developed for seismic oil exploration 
could be applied successfully in other areas. 

John Burg had developed his maximum entropy method (MEM) while 
working in the oil industry in the 1960s. He then went to Stanford as a 
mature graduate student and received his doctorate in 1975 for a thesis 
based largely on his earlier work on MEM [21]. This thesis and a handful 
of earlier presentations at meetings [19, 20] fueled the interest in entropy. 

It was not only the effectiveness of Burg’s techniques that attracted 
the attention of members of the signal-processing community. The classical 
methods seemed to some to be ad hoc, and they sought a more intellectu- 
ally satisfying basis for spectral estimation. Classical methods start with 
the time series data, say £n, for n = 1,..., N. In the direct approach, slightly 
simplified, the data is windowed; that is, £n is replaced with x,w, for some 
choice of constants wn. Then, the vDFT is computed, using the FFT, and 
the squared magnitudes of the entries of the vDFT provide the desired 
estimate of the power spectrum. In the more indirect approach, autocor- 
relation values r,(m) are first estimated, for m = 0,1,...,M, where M is 
some fraction of the data length N. Then, these estimates of r,(m) are 
windowed and the vDFT calculated, again using the FFT. 

What some people objected to was the use of these windows. After 
all, the measured data was £n, not £nWn, so why corrupt the data at the 
first step? The classical methods produced answers that depended to some 
extent on which window function one used; there had to be a better way. 
Entropy maximization was the answer to their prayers. 

In 1981 the first of several international workshops on entropy maxi- 
mization was held at the University of Wyoming, bringing together most 
of the people working in this area. The books [145] and [146] contain the 
papers presented at those workshops. As one can see from reading those 
papers, the general theme is that a new day has dawned. 


13.4 Related Methods 


It was soon recognized that maximum entropy methods were closely 
related to model-based techniques that had been part of statistical time 
series for decades. This realization led to a broader use of autoregressive 
(AR) and autoregressive, moving average (ARMA) models for spectral esti- 
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mation [129], as well as of eigenvector methods, such as Pisarenko’s method 
[126]. What Candy describes as the modern approach to spectral estima- 
tion is one based on explicit parametric models, in contrast to the classical 
non-parametric approach. The book edited by Don Childers [53] is a col- 
lection of journal articles that captures the state-of-the-art at the end of 
the 1970s. 

In a sense the transition from the classical ways to the modern methods 
solved little; the choice of models is as ad hoc as the choice of windows 
was before. On the other hand, we do have a wider collection of techniques 
from which to choose and we can examine these techniques to see when they 
perform well and when they do not. We do not expect one approach to work 
in all cases. High-speed computation permits the use of more complicated 
parametric models tailored to the physics of a given situation. 

Our estimates are intended to be used for some purpose. In medical 
imaging a doctor is going to make a diagnosis based in part on what the 
image reveals. How good the image needs to be depends on the purpose 
for which it is made. Judging the quality of a reconstructed image based 
on somewhat subjective criteria, such as how useful it is to a doctor, is 
a problem that is not yet solved. Human-observer studies are one way to 
obtain this nonmathematical evaluation of reconstruction and estimation 
methods. The next step beyond that is to develop computer software that 
judges the images or spectra as a human would. 


13.5 Entropy Maximization 


The problem of estimating the nonnegative function R(w), for |w| < 7, 
from the finitely many Fourier coefficients 


r(n) = R(w) exp(—inw)dw/27, n = —N,...,N 


=R 


is an under-determined problem, meaning that the data alone is insufficient 
to determine a unique answer. In such situations we must select one solution 
out of the infinitely many that are mathematically possible. The obvious 
questions we need to answer are: What criteria do we use in this selection? 
How do we find algorithms that meet our chosen criteria? In this chapter 
we look at some of the answers people have offered and at one particular 
algorithm, Burg’s maximum entropy method (MEM) [19, 20]. 
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13.6 Estimating Nonnegative Functions 


The values r(n) are autocorrelation-function values associated with a 
random process having R(w) for its power spectrum. In many applications, 
such as seismic remote sensing, these autocorrelation values are estimates 
obtained from relatively few samples of the underlying random process, so 
that N is not large. The DFT estimate, 


N 


Rorr(w) = 5 r(n) exp(inw), 


n=—N 


is real-valued and consistent with the data, but is not necessarily nonnega- 
tive. For small values of N, the DFT may not be sufficiently resolving to be 
useful. This suggests that one criterion we can use to perform our selection 
process is to require that the method provide better resolution than the 
DFT for relatively small values of N, when reconstructing power spectra 
that consist mainly of delta functions. 


13.7 Philosophical Issues 


Generally speaking, we would expect to do a better job of estimating a 
function from data pertaining to that function if we also possess additional 
prior information about the function to be estimated and are able to em- 
ploy estimation techniques that make use of that additional information. 
There is the danger, however, that we may end up with an answer that 
is influenced more by our prior guesses than by the actual measured data. 
Striking a balance between including prior knowledge and letting the data 
speak for itself is a noble goal; how to achieve that is the question. At this 
stage, we begin to suspect that the problem is as much philosophical as it 
is mathematical. 

We are essentially looking for principles of induction that enable us to 
extrapolate from what we have measured to what we have not. Unwill- 
ing to turn the problem over entirely to the philosophers, a number of 
mathematicians and physicists have sought mathematical solutions to this 
inference problem, framed in terms of what the most likely answer is, or 
which answer involves the smallest amount of additional prior information 
[60]. This is not, of course, a new issue; it has been argued for centuries 
with regard to the use of what we now call Bayesian statistics; objective 
Bayesians allow the use of prior information, but only if it is the “right” 
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prior information. The interested reader should consult the books [145] and 
[146], containing papers by Ed Jaynes, Roy Frieden, and others originally 
presented at workshops on this topic held in the early 1980s. 

The maximum entropy method is a general approach to such problems 
that includes Burg’s algorithm as a particular case. It is argued that by 
maximizing entropy we are, in some sense, being maximally noncommittal 
about what we do not know and thereby introducing a minimum of prior 
knowledge (some would say prior guesswork) into the solution. In the case 
of Burg’s MEM, a somewhat more mathematical argument is available. 

Let {X,,}92_., be a stationary random process with autocorrelation 
sequence r(m) and power spectrum R(w), |w| < m. The prediction problem 
is the following: Suppose we have measured the values, at “times” prior to 
n, of one realization of the process and we want to predict the value of the 
process at time n. On average, how much error do we expect to make in 
predicting X, from knowledge of the infinite past? The answer, according 
to Szegö’s Theorem [93], is 


se ( | : ie R(u)dw) 


a log R(w)dw 


=i 


the integral 


is the Burg entropy of the random process [129]. Processes that are very 
predictable have low entropy, while those that are quite unpredictable, 
or, like white noise, completely unpredictable, have high entropy; to make 
entropies comparable, we assume a fixed value of r(0). Given the data r(n), 
|n| < N, Burg’s method selects that power spectrum consistent with these 
autocorrelation values that corresponds to the most unpredictable random 
process. 

Other similar procedures are also based on selection through optimiza- 
tion. We have seen the minimum norm approach to finding a solution to an 
underdetermined system of linear equations, and the minimum expected 
squared error approach in statistical filtering, and later we shall see the 
maximum likelihood method used in detection. We must keep in mind 
that, however comforting it may be to feel that we are on solid philosophi- 
cal ground (if such exists) in choosing our selection criteria, if the method 
does not work well, we must use something else. As we shall see, the MEM, 
like every other reasonable method, works well sometimes and not so well 
other times. There is certainly philosophical precedent for considering the 
consequences of our choices, as Blaise Pascal’s famous wager about the ex- 
istence of God nicely illustrates. As an attentive reader of the books [145] 
and [146] will surely note, there is a certain theological tone to some of 
the arguments offered in support of entropy maximization. One group of 
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authors (reference omitted) went so far as to declare that entropy maxi- 
mization was what one did if one cared what happened to one’s data. 

The objective of Burg’s MEM for estimating a power spectrum is to 
seek better resolution by combining nonnegativity and data-consistency in 
a single closed-form estimate. The MEM is remarkable in that it is the only 
closed-form (that is, noniterative) estimation method that is guaranteed 
to produce an estimate that is both nonnegative and consistent with the 
autocorrelation samples. Later we shall consider a more general method, 
the inverse PDFT (IPDFT), that is both data-consistent and positive in 
most cases. 


13.8 The Autocorrelation Sequence {r(n)} 


We begin our discussion with important properties of the sequence 
{r(n)}. Because R(w) > 0, the values r(n) are often called autocorrela- 
tion values. 

Since R(w) > 0, it follows immediately that r(0) > 0. In addition, 
r(0) > |r(n)| for all n: 


Ir(n)| = | f R(w) exp(—inw)dw/27 


i R(w)| exp(—inw)|dw/2a = r(0). 


=F 


IA 


In fact, if r(0) = |r(n)| > 0 for some n > 0, then R is a sum of at most 
n + 1 delta functions with nonnegative amplitudes. To see this, suppose 
that r(n) = |r(n)| exp(i@) = r(0) exp(i@). Then, 


R(w)|1 — exp(i(@ + nw))|?dw /2m 


T 


= i R(w)(1 — exp(i(@ + nw))(1 — exp(—i(0 + nw))dw/27 
= f R(w)[2 — exp(i(0 + nw)) — exp(—i (0 + nw))]dw/2r 


=2r (0) — exp(ið)r(n) — exp(—i@)r(n) = 2r (0) — r(0) — r(0) = 0. 


Therefore, R(w) > 0 only at the values of w where |1—exp(i(9+nw))|? = 0; 
that is, only at w = n~!(27k — 0) for some integer k. Since |w| < 7, there 
are only finitely many such k. 
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This result is important in any discussion of resolution limits. It is 
natural to feel that if we have only the Fourier coefficients r(n) for |n| < 
N then we have only the low frequency information about the function 
R(w). How is it possible to achieve higher resolution? Notice, however, 
that in the case just considered, the infinite sequence of Fourier coefficients 
is periodic. Of course, we do not know this a priori, necessarily. The fact 
that |r(N)| = r(0) does not, by itself, tell us that R(w) consists solely of 
delta functions and that the sequence of Fourier coefficients is periodic. 
But, under the added assumption that R(w) > 0, it does! When we put 
in this prior information about R(w) we find that the data now tells us 
more than it did before. This is a good example of the point made in the 
Introduction; to get information out we need to put information in. 

In discussing the Burg MEM estimate, we shall need to refer to the 
concept of minimum-phase vectors. We consider that briefly now. 


13.9 Minimum-Phase Vectors 


We say that the finite column vector with complex entries 
(ao, @1,...,an)? is a minimum-phase vector if the complex polynomial 


A(z) = ao tayz+...tanz% 


has the property that A(z) = 0 implies that |z| > 1; that is, all roots of 
A(z) are outside the unit circle. Consequently, the function B(z) given by 
B(z) = 1/A(z) is analytic in a disk centered at the origin and including 
the unit circle. Therefore, we can write 
B(z) = bo + bız + baz? + ee., 

and taking z = exp(iw), we get 

B(exp(iw)) = bo + bı exp(iw) + bz exp(2iw) +... . 
The point here is that B(exp(iw)) is a one-sided trigonometric series, with 
only terms corresponding to exp(inw) for nonnegative n. 


13.10 Burg’s MEM 


The approach is to estimate R(w) by the function S(w) > 0 that maxi- 
mizes the so-called Burg entropy, ie log S(w)dw, subject to the data con- 
straints. 
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The Euler-Lagrange equation from the calculus of variations allows us 
to conclude that S(w) has the form 


S(w) =1/H(w) 
for 


N 
H(w)= X m >0. 
n=—N 


From the Fejér—Riesz Theorem 13.2 we know that H(w) = |A(e™)|? for 
minimum phase A(z). As we now show, the coefficients an satisfy a system 
of linear equations formed using the data r(n). 

Given the data r(n),|n| < N, we form the autocorrelation matric R 
with entries Rmn = r(m—n), for —N < m,n < N. Let ô be the column vec- 
tor ô = (1,0, ...,0)T. Let a = (ao, a1, ... an) be the solution of the system 
Ra = ô. Then, Burg’s MEM estimate is the function S(w) = Rue (w) 
given by 

Rurem(w) = a0/|A(exp(iw))), lw] <7. 


Once we show that ap > 0, it will be obvious that Rarem(w) > 0. We also 
must show that Rmegm is data-consistent; that is, 


r(n) = Ruem(w) exp(—inw)dw/2n =, n = —N,..., N. 


=r 
Let us write Rmgm(w) as a Fourier series; that is, 


+00 
Ruem(w) = 5 q(n) exp(inw), |w| < r. 


n=—Co 


From the form of Ruzm(w), we have 
Ruem(w)A(exp(iw)) = ap B(exp(iw)). (13.1) 


Suppose, as we shall see shortly, that A(z) has all its roots outside the 
unit circle, so B(exp(iw)) is a one-sided trigonometric series, with only 
terms corresponding to exp(inw) for nonnegative n. Then, multiplying on 
the left side of Equation (13.1), and equating coefficients corresponding to 
n = 0, —1, —2, ..., we find that, provided q(n) = r(n), for |n| < N, we must 
have Ra = ô. Notice that these are precisely the same equations we solve 
in calculating the coefficients of an AR process. For that reason the MEM 
is sometimes called an autoregressive method for spectral estimation. 
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13.10.1 The Minimum-Phase Property 


We now show that if Ra = 6 then A(z) has all its roots outside the unit 
circle. Let r exp(i0) be a root of A(z). Then, write 


A(z) = (z — rexp(i@))C(z), 


where 
C(z) =cotaz+ c22? +... ten_i 287}. 


The vector a = (ag, @1,...,ay)’ can be written as a = —rexp(i0)c + d, 


where c = (¢9,C1,-..,€N—1,0)? and d = (0, co, ¢1,...,¢n—1)?. So, 6 = Ra = 
—rexp(i@)Rc + Rd and 


0 = d'8 = —rexp(i0)d' Rc + dt Rd, 


so that 
rexp(i0)d' Re = d Rd. 


From the Cauchy Inequality we know that 
|d' Rc|? < (d'Rd)(c! Rc) = (d' Rd)’, (13.2) 


where the last equality comes from the special form of the matrix R and 
the similarity between c and d. 
With 
D(w) = cge™ + cre... + cy- 16” 
and 
C(w) = co +e +... + A EA Ue, 


we can easily show that 


1 Tw 
d Rd = į Re = en R(w)|D(w)|?dw 


“SR. 


and 
T 


d Re = + R(w)D(w)C(w)dw. 


T Jan 


If there is equality in the Cauchy Inequality (13.2), then r = 1 and we 
would have 


epli | RODU) = = f Rlw)|D(w) Pade 


From the Cauchy Inequality for integrals, we can conclude that 


exp(i0)D(w)C(w) = |Dw)|? 
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for all w for which R(w) > 0. But, 
exp(iw)C(w) = D(w). 


Therefore, we cannot have r = 1 unless R(w) consists of a single delta 
function; that is, R(w) = 6(w — 6). In all other cases we have 


|d! Re}? < \r|?|d' Re}?, 


from which we conclude that |r| > 1. 


13.10.2 Solving Ra = 6 Using Levinson’s Algorithm 


Because the matrix R is Toeplitz, that is, constant on diagonals, and 
positive definite, there is a fast algorithm for solving Ra = 6 for a. Instead 
of a single R, we let Rm be the matrix defined for M = 0,1,..., N by 


r(0) r(—1) oi r(—M) 
r(1) r(0) .. 1T(—-M +1) 
Rm = 
r(M) r(M-1) ... r(0) 
so that R = Ry. We also let 5 be the (M + 1)-dimensional column 
vector ôM = (1,0,...,0)7. We want to find the column vector aM = 
(aM ,al,...,a4)7 that satisfies the equation Rma™ = 6”. The point of 


M+1 M 


Levinson’s algorithm is to calculate a quickly from a”. 
For fixed M find constants a and 8 so that 


a 0 
on aM 
—M-1 
M s 2 
ô = Ruysa +8 
M-1 
M-1 S 
—M-1 
0 Go 
1 WM 
0 0 
= 4a +8 , 
0 0 
eas 1 


where 
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We then have pee 

at ByM =1,ay"+8=0 
or 

B=-ay",a-aly“)? =1 


ri 


SO 
a =1/(1 = l" ’), B= -7/0 |r"). 


Therefore, the algorithm begins with M = 0, Ro = [r(0)], a8 = r(0)~'. At 
each step calculate the y™”, solve for a and 8 and form the next a™. 

The MEM resolves better than the DFT when the true power spectrum 
being reconstructed is a sum of delta functions plus a flat background. 
When the background itself is not flat, performance of the MEM degrades 
rapidly; the MEM tends to interpret any nonflat background in terms of 
additional delta functions. In the next chapter we consider an extension of 
the MEM, called the indirect PDFT (IPDFT), that corrects this flaw. 

Why Burg’s MEM and the IPDFT are able to resolve closely spaced 
sinusoidal components better than the DFT is best answered by studying 
the eigenvalues and eigenvectors of the matrix R; we turn to this topic in 
Chapter 14. 


13.11 A Sufficient Condition for Positive-Definiteness 
If the function 


R(w) = 5 r(njein 
n=—oco 

is nonnegative on the interval [—7, 7], then the matrices Rm are nonnega- 
tive definite for every M. Theorems by Herglotz and by Bochner go in the 
reverse direction [2]. Katznelson [99] gives the following result. 

Theorem 13.1 Let {f(n)}°_., be a sequence of nonnegative real num- 
bers converging to zero, with f(—n) = f(n) for each n. If, for each n > 0, 
we have 


(f(n = 1) = f(n)) = (f(n) = f(n + 1)) > 0, 


then there is a nonnegative function R(w) on the interval |-n, n] with 


f(n) = r(n) for each n. 


The following figures illustrate the behavior of the MEM. In Figures 13.1, 
13.2, and 13.3, the true object has two delta functions at 0.957 and 1.057. 
The data is f(n) for |n| < 10. The DFT cannot resolve the two spikes. The 
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SNR is high in Figure 13.1, and the MEM easily resolves them. In Figure 
13.2 the SNR is much lower and MEM no longer resolves the spikes. 


Ex. 13.1 In Figure 18.3 the SNR is much higher than in Figure 13.1. 
Explain why the graph looks as it does. 


In Figure 13.4 the true object is a box supported between 0.757 and 
1.257. Here N = 10, again. The MEM does a poor job reconstructing the 
box. This weakness in MEM will become a problem in the last two figures, 
in which the true object consists of the box with the two spikes added. In 
Figure 13.5 we have N = 10, while, in Figure 13.6, N = 25. 


— DFT 
— MEM 


6f N=10 7 


SP High SNR 4 


FIGURE 13.1: The DFT and MEM, N = 10, high SNR. 
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— DFT 
— MEM 


3.57 


Low SNR 


FIGURE 13.2: The DFT and MEM, N = 10, low SNR. 


13.12 The IPDFT 


Experience with Burg’s MEM shows that it is capable of resolving 
closely spaced delta functions better than the DFT, provided that the back- 
ground is flat. When the background is not flat, MEM tends to interpret 
the non-flat background as additional delta functions to be resolved. In this 
chapter we consider an extension of MEM based on the PDFT that can 
resolve in the presence of non-flat background. This method is called the 
indirect PDFT (IPDFT) [26]. 
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FIGURE 13.3: The DFT and MEM, N = 10, very high SNR. What 
happened? 


13.13 The Need for Prior Information in Nonlinear 
Estimation 


As we saw previously, the PDFT is a linear method for incorporating 
prior knowledge into the estimation of the Fourier transform. Burg’s MEM 
is a nonlinear method for estimating a non-negative Fourier transform. 

The IPDFT applies to the reconstruction of one-dimensional power 
spectra, but the main idea can be used to generate high-resolution methods 
for multi-dimensional spectra as well. The IPDFT method is suggested by 
considering the MEM equations Ra = 6 as a particular case of the equa- 
tions that arise in Wiener filter approximation. As in the previous chapter, 
we assume that we have the autocorrelation values r(n) for |n| < N, from 
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FIGURE 13.4: MEM and DFT for a box object; N = 10. 


which we wish to estimate the power spectrum 


+oo 


R(w) = 5 r(nje'™, |w| < r. 


n=— oO 


13.14 What Wiener Filtering Suggests 


In Chapter 20 on Wiener filter approximation, we show that the best 
finite length filter approximation of the Wiener filter H (w) is obtained by 
minimizing the integral in Equation (20.3) 


iz 


L 2 
H(w)— > fee] (Raw) + Ru(w))dw. 
k=—-K 
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60- N=10 
50F aa | Big box and spikes 


FIGURE 13.5: The DFT and MEM: two spikes on a large box; N = 10. 


The optimal coefficients then must satisfy Equation (20.4): 


L 
rs(m) = X` fe(rs(m—k) +ru(m-— k)), (13.3) 


k=—K 


for -K<m<L. 

Consider the case in which the power spectrum we wish to estimate 
consists of a signal component that is the sum of delta functions and a 
noise component that is white noise. If we construct a finite-length Wiener 
filter that filters out the signal component and leaves only the noise, then 
that filter should be able to zero out the delta-function components. By 
finding the locations of those zeros, we can find the supports of the delta 
functions. So the approach is to reverse the roles of signal and noise, viewing 
the signal as the component called u and the noise as the component called 
s in the discussion of the Wiener filter. The autocorrelation function r,(n) 
corresponds to the white noise now and so r,(n) = 0 for n 4 0. The terms 
rs(n) + 7u(m) are the data values r(n), for |n| < N. Taking K = 0 and 
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FIGURE 13.6: The DFT and MEM: two spikes on a large box; N = 25. 


L = N in Equation (13.3), we obtain 


N 
X fer(m —k) =0, 
k=0 


for m = 1,2,..., N and 


N 
X. fer(O— k) = r(0), 
k=0 


which is precisely that same system Ra = 6 that occurs in MEM. 

This approach reveals that the vector a = (ao, ...,an)! we find in MEM 
can be viewed as a finite-length approximation of the Wiener filter designed 
to remove the delta-function component and to leave the remaining flat 
white-noise component untouched. The polynomial 


N 
A(w) = ys ane” 
n=0 


Nonlinear Methods 211 


will then have zeros near the supports of the delta functions. What happens 
to MEM when the background is not flat is that the filter tries to eliminate 
any component that is not white noise and so places the zeros of A(w) in 
the wrong places. 


13.15 Using a Prior Estimate 


Suppose we take P(w) > 0 to be our estimate of the background com- 
ponent of R(w); that is, we believe that R(w) equals a multiple of P(w) 
plus a sum of delta functions. We now ask for the finite length approx- 
imation of the Wiener filter that removes the delta functions and leaves 
any background component that looks like P(w) untouched. We then take 
rs(n) = p(n), where 


+00 


P(w) = y pe”, |w| <r. 


n=— oo 


The desired filter is f = (fo,..., fy)” satisfying the equations 


N 
p(m) = X` fer(m—k). (13.4) 
k=0 


Once we have found f we form the polynomial 


N 
F(w) = 5 fre”, wl < r. 


k=0 


The zeros of F(w) should then be near the supports of the delta func- 
tion components of the power spectrum R(w), provided that our original 
estimate of the background is not too inaccurate. 

In the PDFT it is important to select the prior estimate P(w) nonzero 
wherever the function being reconstructed is nonzero; for the IPDFT the 
situation is different. Comparing Equation (13.4) with Equation (2.23), we 
see that in the IPDFT the true R(w) is playing the role previously given to 
P(w), while P(w) is in the role previously played by the function we wished 
to estimate, which, in the IPDFT, is R(w). It is important, therefore, that 
R(w) not be zero where P(w) Æ 0; that is, we should choose the P(w) = 0 
wherever R(w) = 0. Of course, we usually do not know the support of R(w) 
a priori. The point is simply that it is better to make P(w) = 0 than to 
make it nonzero, if we have any doubt as to the value of R(w). 
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13.16 Properties of the IPDFT 


In our discussion of the MEM, we obtained an estimate for the function 
R(w), not simply a way of locating the delta-function components. As we 
shall show, the IPDFT can also be used to estimate R(w). Although the 
resulting estimate is not guaranteed to be nonnegative and data consistent, 
it usually is both of these. 

For any function G(w) on [—7, 7] with Fourier series 


Glw) = Y gine, 


n=—Co 


the additive causal part of the function G(w) is 
alo) = D ge. 
n=0 


Any function such as G4 that has Fourier coefficients that are zero for 
negative indices is called a causal function. The Equation (13.4) then says 
that the two causal functions P, and (FR)+ have Fourier coefficients that 
agree for m = 0,1,..., N. 

Because F'(w) is a finite causal trigonometric polynomial, we can write 


(FR)4(w) = Ry(w)F(w) + J(w), 


where A gate | 
J(w) = > ( > r(—k)f (m+ k))e™. 
m=0 k=1 


Treating P, as approximately equal to (FR), = R+F +J, we obtain as an 
estimate of R, the function Q = (P, — J)/F. In order for this estimate of 
R, to be causal, it is sufficient that the function 1/F be causal. This means 
that the trigonometric polynomial F'(w) must be minimum phase; that is, 
all its roots lie outside the unit circle. In our discussion of the MEM, we 
saw that this is always the case for MEM. It is not always the case for the 
IPDFT, but it is usually the case in practice; in fact, it was difficult (but 
possible) to construct a counterexample. We then construct our IPDFT 
estimate of R(w), which is 


Rieprr(w) = 2Re(Q(w)) =, r(0). 


The IPDFT estimate is real-valued and, when 1/F is causal, guaranteed 
to be data consistent. Although this estimate is not guaranteed to be non- 
negative, it usually is. 
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We showed previously that the vector a that solves Ra = 6 corresponds 
to a polynomial A(z) having all its roots on or outside the unit circle; that is, 
it is minimum phase. The IPDFT involves the solution of the system Rf = 
p, where p = (p(0),...,p(V))” is the vector of initial Fourier coefficients of 
another power spectrum, P(w) > 0 on [—7, 7]. When P(w) is constant, we 
get p = ô. For the IPDFT to be data-consistent, it is sufficient that the 
polynomial F(z) = fo+...+fnz% be minimum phase. Although this need 
not be the case, it is usually observed in practice. 


Ex. 13.2 Find conditions on the power spectra R(w) and P(w) that cause 
F(z) to be minimum phase. Warning: I have not solved this, so it is probably 
not an easy exercise. 


13.17 Illustrations 


The figures below illustrate the IPDFT. The prior function in each case 
is the box object supported on the central fourth of the interval [0, 27]. The 
value r(0) has been increased slightly to regularize the matrix inversion. 
Figure 13.7 shows the behavior of the IPDFT when the object is only the 
box. Contrast this with the behavior of MEM in this case, as seen in Figure 
13.4. Figures 13.8 and 13.9 show the abilty of the IPDFT to resolve the two 
spikes at 0.957 and 1.057 against the box background. Again, contrast this 
with the MEM reconstructions in Figures 13.5 and 13.6. To show that the 
IPDFT is actually indicating the presence of the spikes and not just rolling 
across the top of the box, we reconstruct two unequal spikes in Figure 
13.10. Figure 13.11 shows how the IPDFT behaves when we increase the 
number of data points; now, N = 25 and the SNR is very low. 


13.18 Fourier Series and Analytic Functions 


Suppose that f(z) is analytic in an annulus containing the unit circle 
C = {z||z| = 1}. Then f(z) has a Laurent series expansion 


f= 5 fnz” 


n=—Co 
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FIGURE 13.7: The DFT and IPDFT: box only, N = 1. 


valid for z within that annulus. Substituting z = e’”, we get f(e’’), also 
written as b(@), defined for 0 in the interval [—7, 7] by 


b(8) = fle) = So fne; 


n=—Cco 


here the Fourier series for b(@) is derived from the Laurent series for the 
analytic function f(z). If f(z) is actually analytic in (1 + «)D, where 
D = {z||z| < 1} is the open unit disk, then f(z) has a Taylor series ex- 
pansion and the Fourier series for b(0) contains only terms corresponding 
to nonnegative n. 


13.18.1 An Example 


As an example, consider the rational function 


flA = L- s 3! (z 5) & 3). (13.5) 
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1F J Box and spikes 4 


FIGURE 13.8: The DFT and IPDFT, box and two spikes, N = 10, high 
SNR. 


In an annulus containing the unit circle this function has the Laurent series 


expansion 
—1 co 1 n+l 
= gntl ny SE m: 
ra= Sores (i) = 


n=— 00 n=0 


replacing z with e’’, we obtain the Fourier series for the function b(0) = 
f(e?) defined for 0 in the interval [—7, 7]. 

The function F(z) = 1/ f(z) is analytic for all complex z, but because 
it has a root inside the unit circle, its reciprocal, f(z), is not analytic in 
a disk containing the unit circle. Consequently, the Fourier series for b(@) 
is doubly infinite. We saw in the chapter on complex varables that the 
function G(z) = © has |G(e’)| = 1. With a = 2 and A(z) = F(z)G(z), 


l-az 
we have 


H(z) = Ele- 3)(z— 2), 
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FIGURE 13.9: The DFT and IPDFT, box and two spikes, N = 10, 
moderate SNR. 


and its reciprocal has the form 
1/H(z)= 5 Anz”. 
n=0 
Because 


G(e")/H(e*) = 1/F (e°), 


it follows that 
|1/H(e”)| = |1/F(e'’)| = |0(9)| 


and so 


\b(6)| = | Sane” 
n=0 


Multiplication by G(z) permits us to move a root from inside C to outside 
C without altering the magnitude of the function’s values on C. 
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FIGURE 13.10: The DFT and IPDFT, box and unequal spikes, N = 10, 
high SNR. 


The relationships between functions defined on C and functions an- 
alytic (or harmonic) in D form the core of harmonic analysis [93]. The 
factorization F(z) = H(z)/G(z) above is a special case of the inner-outer 
factorization for functions in Hardy spaces; the function H(z) is an outer 
function, and the functions G(z) and 1/G(z) are inner functions. 


13.18.2 Hyperfunctions 


The rational function f(z) given by Equation (13.5) is analytic in an 
annulus containing the unit circle in its interior. The annulus has width 
equal to 2.5, the distance between the roots z = 0.5 and z = 3. Within that 
annulus the function has a convergent Laurent expansion, and by setting 
z = e" we get the Fourier series for the function b(@) on [—7,7]. But not 
every function that has a convergent Fourier series is the restriction to the 
unit circle of a function analytic in an annulus containing the unit circle. 


To extend the notion that Fourier series are related to Laurent series, we 
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FIGURE 13.11: The DFT and IPDFT, box and unequal spikes, N = 25, 
very low SNR. 


have to entertain the possibility of the width of the annulus shrinking to 
zero. This leads to the theory of hyperfunctions, introduced in 1958 by the 
Japanese mathematician Mikio Sato [135] (see also [125]). To get a sense 
of what is involved without going far into details, we consider the Fourier 
series for the Dirac delta function. 

The Fourier series for 5(2) is 


da) = Y eit. 


n=—Co 


Replacing et? with z, we get the Laurent series 


foe) Co 
—1 + 5 z” + 5 a: 
n=0 n=0 


Nonlinear Methods 219 


The first sum converges for |z| > 1 and 


= i 1 DRE- 
AS gar 


n=0 


The second sum converges for |z| < 1 and is 


Co 7 1 
2? Tae 


n=0 


The sum of these two functions is 


z A 1 2-1 
z— 1 l-z z- 
so that i 
2 
—1 =0 
Taaa IS? 


for z £1. For z = 1, so that « = 0, the Laurent series sums to +00. We 
can see, therefore, that, in some sense, the Fourier series for ô(x) can be 
understood in terms of a Laurent series, but that the associated annulus of 
the Laurent series has zero width; there is no actual annulus within which 
both halves of the series converge. What we have, instead, are two functions, 
one analytic on the inside of the unit circle, and the other analytic on the 
outside. Sato’s idea is to consider as a single object, a hyperfunction, the 
pair of functions so defined. 


13.19 Fejér—Riesz Factorization 


Sometimes we start with an analytic function and restrict it to the unit 
circle. Other times we start with a function f(e’’) defined on the unit circle, 
or, equivalently, a function of the form b(@) for 0 in [—7, z], and view this 
function as the restriction to the unit circle of a function that is analytic 
in a region containing the unit circle. One application of this idea is the 
Fejér—Riesz factorization theorem: 


Theorem 13.2 Let h(e’) be a finite trigonometric polynomial 


N 
h(e®) = 5 hne? , 
n=—N 
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such that h(e*’) > 0 for all 0 in the interval |—1, n]. Then there is 


N 
y(z) = 5 Ynz” 
n=0 


with h(e®) = |y(e’”)|?. The function y(z) is unique if we require, in addi- 
tion, that all its roots be outside D. 


To prove this theorem we consider the function 


N 
h(z) = 5 hynz”, 
n=—N 


which is analytic in an annulus containing the unit circle. The rest of the 
proof is contained in the following exercise. 


Ex. 13.3 Use the fact that h-n = hn to show that zj is a root of h(z) if 
and only if 1/Z; is also a root. From the nonnegativity of h(e®), conclude 
that if h(z) has a root on the unit circle then it has even multiplicity. Take 
y(z) to be proportional to the product of factors z — zj for all the z; outside 
D; for roots on C, include them with half their multiplicities. 


13.20 Burg Entropy 


The Fejér—Riesz theorem is used in the derivation of Burg’s maximum 
entropy method for spectrum estimation. The problem there is to estimate 
a function R(0) > 0 knowing only the values 

1 E —ind 
Tn = — R(0)e "d0, 
2T Jor 
for |n| < N. The approach is to estimate R(0) by the function S(0) > 0 
that maximizes the so-called Burg entropy, S log S(6)d6, subject to the 
data constraints. 

The Euler-Lagrange Equation from the calculus of variations allows us 

to conclude that S(@) has the form 


N 
SO) =1/ XO hne”. 
n=—N 
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The function T 
h(0) = X hne” 
n=—N 


is nonnegative, so, by the Fejér—Riesz theorem, it factors as h(@) = |y(0)|?. 
We then have $(0)y(@) = 1/y(0). Since all the roots of y(z) lie outside D 
and none are on C, the function 1/y(z) is analytic in a region containing 
C and D so it has a Taylor series expansion in that region. Restricting this 
Taylor series to C, we obtain a one-sided Fourier series having zero terms 


for the negative indices. 


Ex. 13.4 Show that the coefficients yn in y(z) satisfy a system of linear 
equations whose coefficients are the rn. Hint: Compare the coefficients of 
the terms on both sides of the equation S(0)G(0) = 1/y(0) that correspond 
to negative indices. 


13.21 Some Eigenvector Methods 


Prony’s method shows that information about the signal can sometimes 
be obtained from the roots of certain polynomials formed from the data. 
Eigenvector methods are similar, as we shall see now. 

Eigenvector methods assume the data are correlation values and in- 
volve polynomials formed from the eigenvectors of the correlation matrix. 
Schmidt’s multiple signal classification (MUSIC) algorithm is one such 
method [136]. A related technique used in direction-of-arrival array pro- 
cessing is the estimation of signal parameters by rotational invariance tech- 
niques (ESPRIT) of Paulraj, Roy, and Kailath [123]. 


13.22 The Sinusoids-in-Noise Model 


We suppose now that the function f(t) being measured is signal plus 
noise, with the form 


J 
f(t) = D |Ajlee 3" + n(t) = s(t) +n(t), 
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where the phases 0; are random variables, independent and uniformly dis- 
tributed in the interval [0, 27), and n(t) denotes the random complex sta- 
tionary noise component. Assume that E(n(t)) = 0 for all t and that the 
noise is independent of the signal components. We want to estimate J, the 
number of sinusoidal components, their magnitudes |A,| and their frequen- 
cies wj. 


13.23 Autocorrelation 


The autocorrelation function associated with s(t) is 
J s 
n = 1A”, 
j=1 
and the signal power spectrum is the Fourier transform of r,(7), 
J 
Rs(w) = $ 4 Pal — wy). 
j=1 


The noise autocorrelation is denoted r (T) and the noise power spectrum 
is denoted R,,(w). For the remainder of this section we shall assume that 
the noise is white noise; that is, R,(w) is constant and rn (T) = 0 for T 4 0. 

We collect samples of the function f(t) and use them to estimate some of 
the values of r,(7). From these values of r;(7), we estimate R,(w), primarily 
looking for the locations w; at which there are delta functions. 

We assume that the samples of f(t) have been taken over an interval 
of time sufficiently long to take advantage of the independent nature of 
the phase angles 0; and the noise. This means that when we estimate the 
rs(T) from products of the form f(t +7)f(t), the cross terms between one 
signal component and another, as well as between a signal component and 
the noise, are nearly zero, due to destructive interference coming from the 
random phases. 

Suppose now that we have the values r(m) for m = —(M-—1),..., M—1, 
where M > J, r¢(m) = rs(m) for m £ 0, and r¢(0) = r:(0) + 0”, for o? 
the variance (or power) of the noise. We form the M by M autocorrelation 
matrix R with entries Rm, = rf(m — k). 


Ex. 13.5 Show that the matrix R has the following form: 
I: 
R — > |A; eje} + o°I, 
j=1 


where ej is the column vector with entries e~"*i", for n = 0,1,..., M — 1. 
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Let u be an eigenvector of R with ||u|| = 1 and associated eigenvalue A. 
Then we have 


J 
\=ulRu= Y |4 Pletu +0? > o. 
j=1 


Therefore, the smallest eigenvalue of R is o°. 

Because M > J, there must be non-zero M-dimensional vectors v that 
are orthogonal to all of the ej; in fact, we can say that there are M — J 
linearly independent such v. For each such vector v we have 


J 
Rv = > |A; etve; +0°v =0°v; 
j=1 


consequently, v is an eigenvector of R with associated eigenvalue o?. 

Let Ay > A2 >... > Am > O be the eigenvalues of R and let u™ be 
a norm-one eigenvector associated with Am. It follows from the previous 
paragraph that Am = 07, for m = J +1,...,M, while Àm > o? for m = 
1,..., J. This leads to the MUSIC method for determining the wj. 


13.24 Determining the Frequencies 


By calculating the eigenvalues of R and noting how many of them are 
greater than the smallest one, we find J. Now we seek the wj. 
For each w, we let e,, have the entries e~*””, for n = 0,1,..., M — 1 and 


form the function 
M 


Tw)= X lehu™. 


m=J+1 


This function T(w) will have zeros at precisely the values w = wj, for j = 
1, ..., J. Once we have determined J and the wj, we estimate the magnitudes 
|A,| using Fourier transform estimation techniques already discussed. This 
is basically Schmidt’s MUSIC method. 

We have made several assumptions here that may not hold in practice 
and we must modify this eigenvector approach somewhat. First, the time 
over which we are able to measure the function f(t) may not be long enough 
to give good estimates of the ry(7). In that case we may work directly with 
the samples of f(t). Second, the smallest eigenvalues will not be exactly 
equal to g? and some will be larger than others. If the wj are not well 
separated, or if some of the |A,| are quite small, it may be hard to tell 
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what the value of J is. Third, we often have measurements of f(t) that 
have errors other than those due to background noise; inexpensive sensors 
can introduce their own random phases that can complicate the estimation 
process. Finally, the noise may not be white, so that the estimated r f(r) will 
not equal r,(7) for 7 4 0, as before. If we know the noise power spectrum 
or have a decent idea what it is, we can perform a pre-whitening to R, 
which will then return us to the case considered above, although this can 
be a tricky procedure. 


13.25 The Case of Non-White Noise 


When the noise power spectrum has a component that is not white 
the eigenvalues and eigenvectors of R behave somewhat differently from 
the white-noise case. The eigenvectors tend to separate into three groups. 
Those in the first group correspond to the smallest eigenvalues and are 
approximately orthogonal to both the signal components and the nonwhite 
noise component. Those in the second group, whose eigenvalues are some- 
what larger than those in the previous group, tend to be orthogonal to the 
signal components but to have a sizable projection onto the nonwhite-noise 
component. Those in the third group, with the largest eigenvalues, have siz- 
able projection onto both the signal and nonwhite noise components. Since 
the DFT estimate uses R, as opposed to R~!, the DFT spectrum is deter- 
mined largely by the eigenvectors in the third group. The MEM estimator, 
which uses R~!, makes most use of the eigenvectors in the first group, but 
in the formation of the denominator. In the presence of a nonwhite-noise 
component, the orthogonality of those eigenvectors to both the signals and 
the nonwhite noise shows up as peaks throughout the region of interest, 
masking or distorting the signal peaks we wish to see. 

There is a second problem exacerbated by the nonwhite component: 
sensitivity of nonlinear and eigenvector methods to phase errors. We have 
assumed up to now that the data we have obtained is accurate, but there 
isn’t enough of it. In some cases the machinery used to obtain the measured 
data may not be of the highest quality; certain applications of sonar make 
use of relatively inexpensive hydrophones that will sink into the ocean after 
they have been used briefly. In such cases the complex numbers r(n) will be 
distorted. Errors in the measurement of their phases are particularly dam- 
aging. Techniques for stabilizing high-resolution methods were presented in 
[28]. 
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14.1 Chapter Summary 


In Chapter 13 we considered the problem of estimating a nonnegative 
function of a continuous variable from finitely many of its Fourier coeffi- 
cients. The estimate was again a function of a continuous variable. In such 
cases, we would convert the estimate to a finite vector just prior to graph- 
ing the estimate. In this chapter we discuss an alternative approach, in 
which the nonnegative function to be estimated is discretized at the out- 
set. Discrete entropy maximization and related procedures are then used 
to reconstruct the nonnegative vector from finitely many linear-functional 
values. Unlike the MEM and the IPDFT methods, the algorithms we fo- 
cus on here, primarily the multiplicative algebraic reconstruction technique 
(MART) and its simultaneous version, the SMART, are iterative. 

We begin with the algebraic reconstruction technique (ART), which is 
not related to entropy maximization but which will help to motivate its 
multiplicative variant, the multiplicative algebraic reconstruction technique 
(MART). As we shall see, the MART is an iterative entropy-maximization 
method. 
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14.2 The Algebraic Reconstruction Technique 


The ART, designed originally for the reconstruction of medical images 
in computerized tomography [84], is an iterative algorithm for finding a 
solution of a consistent system of linear equations, Ax = b, where A is an 
arbitrary I by J complex matrix. In the tomography case the vector x is a 
vectorization of a two- or three-dimensional discrete image, the vector b is 
the vector of measured data, and the matrix A describes the geometry of 
the sensing process. 

In the ART we begin by choosing an arbitrary starting vector, denoted 
x°, Having computed x*, we calculate the next vector, «*+1, using the 
formula 


oft! = af + oF Tg (Aa), 


for k = 0,1,...,¢ = k(mod J) + 1, and 


When Az = b has a solution the sequence {x*} converges to the solution 
of the system closest to the starting vector, 2°; when x° = 0 the sequence 
converges to the minimum-two-norm solution. 


14.3 The Multiplicative Algebraic Reconstruction 
Technique 


The images to be reconstructed in transmission or emission tomogra- 
phy are necessarily nonnegative. The multiplicative algebraic reconstruc- 
tion technique, MART, is a variant of the ART that incorporates the prior 
information that the image to be reconstructed is nonnegative [84]. Like 
the ART, the MART can be used to solve more general systems of linear 
equations, although, for the MART, the matrix and vectors involved must 
be nonnegative. 

Let P be an I by J matrix with nonnegative entries P,; > 0, such that 
sj = ey P;; > 0, for j =1,..., J. Let y be the J-dimensional vector with 
entries y; > 0, and suppose that the linear system of equations y = Px has 
a nonnegative solution x. 
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For the MART we begin with a positive vector x°. Having computed 


x, we calculate the next vector, «*+!, using the formula 
; > Pij 
k+1 _ „nk Yi a ij 
git — qf 14.1 
j j (oat, , ( ) 


where m; = max{P;;|j = 1,..., J}, and i = k(mod I) + 1. When there is 
a nonnegative solution for y = Px, the sequence {zx} } converges to such a 
solution. When there are multiple nonnegative solutions, we would like to 
know which solution MART gives us; in particular, we want to know how 
the solution depends on the starting vector x°. The answer involves the 
Kullback-Leibler, or cross-entropy, distance. 


14.4 The Kullback—Leibler Distance 


For real numbers a > 0 and b > 0 we define the Kullback-Leibler, or 
cross-entropy, distance from a to b to be 


KL(a, 6) = alog% +b—-a. 


It follows from the inequality logt < t — 1, with equality if and only if 
t = 1, that KL(a,b) > 0, and KL(a,b) = 0 if and only if a = b. We 
also let KL(0,b) = b and KL(a,0) = +00. We extend the KL distance to 
nonnegative vectors x and z by 


J 
KL(a,z) = XO KL (aj, zj). 
j=1 


Since the function 
f(t) =t- 1- logt 
is convex, we have the following useful lemma: 


Lemma 14.1 Leta be a fixed positive number. If the set {K L(a, b)|b € B} 
is bounded, then so is B. 


Let x and z be nonnegative vectors, with x4} = Dia 
calculation shows that 


KL(x4},z4) < KL(z, z). (14.2) 


x;. Then a simple 


When x = u, where u; = 1 for all j, and z is a probability vector, we have 


J 
KL(u,z)+J-1 = —S logzj, 


j=1 
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which is the negative of the Burg entropy of the probability vector z. Sim- 


ilarly, 
J 


KL(z,u)-J+1l= 5 zj log z;, 
j=l 

which is the negative of the Shannon entropy of the probability vector z. 

When the system y = Px has a nonnegative solution the MART se- 
quence converges to the nonnegative solution x for which the Kullback— 
Leibler distance K L(x, x°?) is minimized. Therefore, when we select x° = u, 
the MART sequence converges to the solution of y = Px maximizing the 
Shannon entropy. 


14.5 The EMART 


We see from Equation (14.1) that the MART is computationally more 
complicated than the ART. The EMART [37] is an iterative method that, 
like the MART, applies to nonnegative systems of linear equations, while, 
like the ART, requires no exponentiation. 

Note that we can rewrite the right side of Equation (14.1) as a weighted 
geometric mean of two terms: 


l—-m>'Pi; f m7! Pij 
k+1 _ (nk i ok Yi ees 
a (24) (x; oa 
In the EMART we exchange the weighted geometric means for weighted 
arithmetic means. The iterative step of the EMART is 

z3 = (1 -my 'Pi)e} + mz Py (2 Pe). (14:3) 
When the system y = Px has nonnegative solutions, the sequence {x"} 
generated by Equation (14.3) converges to a nonnegative solution. Unlike 


the ART and MART, we do not know how the solution depends on the 
starting vector x°. 


14.6 Simultaneous Versions 


All of the three methods discussed so far are sequential, in that only a 
single equation is used at each step of the iteration. There are simultaneous 
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versions of these algorithms that use all of the equations at each step [41]. 
One reason for their use is that they also converge when the system of 
equations is not consistent. 


14.6.1 The Landweber Algorithm 


The simultaneous version of the ART is the Landweber algorithm [7]. 
The iterative step of the Landweber algorithm is the following: 


okt) = ok 4 yAT(b— (Ax*)), (14.4) 


withO<y< oat ay? and p(At A) is the spectral radius of At A, which is its 
largest eigenvalue, in this case. When Ax = b has solutions the sequence 
generated by Equation (14.4) converges to the solution closest to x° in the 
two-norm. When Az = b has no solutions, the sequence converges to the 
minimizer of || Ax — b||2 for which ||x — x°||2 is minimized. 


14.6.2 The SMART 


The simultaneous MART (SMART) is the simultaneous version of the 
MART. The iterative step of the SMART is the following: 


I 
oft! 7 of ese (i 5 P,; log Pan: (14.5) 
i=1 k 


When the system y = Px has nonnegative solutions, the sequence generated 
by Equation (14.5) converges to the nonnegative solution that minimizes 
K L(x,x?), just as the MART does. In addition, when the system y = Px 
is inconsistent, that is, has no nonnegative solution, the SMART sequence 
converges to the nonnegative minimizer of KL(Px,y) for which KL(x, x?) 
is minimized. 


14.6.3 The EMML Algorithm 


Closely related to the SMART is the EMML algorithm, which is a spe- 
cial case of a more general method known in statistics as the EM algorithm. 
The iterative step of the EMML algorithm is the following: 


I 
k+1 _ ko-1 i Yi 
xj = 258; 2 P; (oa) (14.6) 


When the system y = Px has nonnegative solutions, the sequence gener- 
ated by Equation (14.6) converges to a nonnegative solution. When the 
system y = Px has no nonnegative solutions, the sequence converges to a 


230 Signal Processing: A Mathematical Approach 


nonnegative minimizer of K L(y, Px). In neither case can we say precisely 
how the limit of the sequence depends on the starting vector, x°. 

In the urn model for remote sensing, as discussed in Chapter 1, the 
entries of the vector y are yi, the proportion of the trials in which the ith 
color was drawn, so that y+ = 1. In addition, we have s; = 1. We want the 
solution x to be a probability vector, as well. When sj = 1 for each j and 
there are exact nonnegative solutions of y = Px, the solutions provided by 
the SMART and the EMML method, although they may be different, will 
both be probability vectors. However, when y = Px has no nonnegative 
solution, the limit of the SMART sequence will have x} < 1, while the 
limit of the EMML sequence will still be a probability vector. For details 
and references concerning the SMART and the EMML algorithm, consult 
[41]. 


14.6.4 Block-Iterative Versions 


Simultaneous iterative algorithms tend to converge slowly. Sequential 
versions of these algorithms, which typically converge more rapidly, may 
make inefficient use of the machine architecture. Block-iterative versions 
of these algorithms permit the use of some, but perhaps not all, of the 
equations at each step. A block-iterative version of the EMML algorithm 
was used to obtain sub-pixel resolution from SAR image data [117]. 


14.6.5 Convergence of the SMART 


We turn now to a proof of convergence of the SMART. Convergence of 
the MART and the block-iterative versions of SMART are proved similarly 
and we omit these proofs. 

We shall assume, for notational convenience, that s; = 1 for all j. As we 
have seen, this is sometimes the case in applications, and if it is not true, 
we can redefine both P and x to make it happen. Using Equation (14.2), 
we have 

KL(a,z) > KL(Pzx, Pz), 


for all nonnegative x and z. 
We use the alternating minimization formalism of [61]. For each non- 
negative x for which 


J 
(Px); = X Pysy 
j=1 
is positive, for each i, we let r(x) and q(x) be the I by J matrices with 


entries yi 
r(z)ij = xj Pij aa 
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and 
Q(x)iz = 25 Pig. 


The alternating minimization then involves the function 
I J 
KL(q(x),r(2)) = X X KLE) res) 
i=1 j=1 
The following Pythagorean identities are central to the proof: 
KL(q(x),7(z)) = KL(q(a), r(a)) + KL (x, z) — KL(Px, Pz), 


and 
KL(q(2),r(z)) = KL(q(2’), r(z)) + KL (a, 2’), 


where : 
D Yi 
z; = Zj CXp ( 2a Pij log EAD š 


It follows, then, that, having calculated x*, we get x**+! by minimizing 
KL (q(x), r(x*)) over all nonnegative x. The remainder of the convergence 
proof is contained in the following sequence of exercises. 


Ex. 14.1 Use the Pythagorean identities and the fact that 
KL (q(x), 7(a)) = KL(Pa,y) 


to show that the sequence {KL(Px*,y)} is decreasing, and the sequence 
{KL (x*,x**!)} converges to zero. 


Ex. 14.2 Show that 


J 
Se, <$ yi, 


j=1 i=1 


so that the sequence {x"} is a bounded sequence. 


Ex. 14.3 Let x* be any cluster point of the sequence {x*}. Show that 
(SaN 
Ex. 14.4 Let ĉ be a nonnegative minimizer of KL(Px,y). Show that 


(a) =â. 


Note that, since K L(x, z) is strictly convex in each variable, the vector 
Pĉ is unique, even if ĉ is not unique. 
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Ex. 14.5 Show that 


KL(é, 2") — KL(#,2**') = KL(Pa2**', y) — KL(P3, y) 


+K L(P#, Px") + KL(2**), 2") — KL(Px*t!, Pz"), (14.7) 


so that KL(P%,Px*) = 0, the sequence {KL(%,x*)} is decreasing, and 
KL(&,x*) is finite. 


Ex. 14.6 Show that, for any cluster point «*, KL(Px*,y) = KL(Pĉ, y). 


We know now that x* is a nonnegative minimizer of KL(Px,y). Replacing 
ĉ with x*, we find that the sequence {K L(x*,«*)} converges to zero. Since 
the right side of Equation (14.7) depends only on P# and not directly on 
&, so does the left side. Consequently, summing the left side over k, we find 
that 

KL(ĉ, 2°) — KL(&, 2*) 


is independent of the choice of ĉ. It follows that x* is the nonnegative 
minimizer of KL(Px,y) for which K L(x, x?) is minimized. 
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15.1 Chapter Summary 


Analysis and synthesis in signal processing refers to the effort to study 
complicated functions in terms of simpler ones. The basic building blocks 
are orthogonal bases and frames. 

We begin with signal-processing problems arising in radar. Not only 
does radar provide an important illustration of the application of the theory 
of Fourier transforms and matched filters, but it also serves to motivate 
several of the mathematical concepts we shall encounter in our discussion 
of wavelets. The connection between radar signal processing and wavelets 
is discussed in some detail in Kaiser’s book [97]. 

There are applications in which the frequency composition of the signal 
of interest will change over time. A good analogy is a piece of music, where 
notes at certain frequencies are heard for a while and then are replaced by 
notes at other frequencies. We do not usually care what the overall contri- 
bution of, say, middle C is to the song, but do want to know which notes are 
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to be sounded when and for how long. Analyzing such non-stationary sig- 
nals requires tools other than the Fourier transform: the short-time Fourier 
transform is one such tool; wavelet expansion is another. 


15.2 The Basic Idea 


An important theme that runs through most of mathematics, from 
the geometry of the early Greeks to modern signal processing, is analy- 
sis and synthesis, or, less formally, breaking up and putting back together. 
The Greeks estimated the area of a circle by breaking it up into sectors 
that approximated triangles. The Riemann approach to integration involves 
breaking up the area under a curve into pieces that approximate rectangles 
or other simple shapes. Viewed differently, the Riemann approach is first 
to approximate the function to be integrated by a step function and then 
to integrate the step function. 

Along with geometry, Euclid includes a good deal of number theory, 
where, again, we find analysis and synthesis. His theorem that every posi- 
tive integer is divisible by a prime is analysis; division does the breaking up 
and the simple pieces are the primes. The fundamental theorem of arith- 
metic, which asserts that every positive integer can be written in a unique 
way as the product of powers of primes, is synthesis, with the putting back 
together done by multiplication. 


15.3 Polynomial Approximation 


The individual power functions, x”, are not particularly interesting by 
themselves, but when finitely many of them are scaled and added to form a 
polynomial, interesting functions can result, as the famous approximation 
theorem of Weierstrass confirms [101]: 


Theorem 15.1 If f : [a,b] > R is continuous and € > 0 is given, we can 
find a polynomial P such that | f(x) — P(a)| < e for every x in [a,b]. 


The idea of building complicated functions from powers is carried a 
step further with the use of infinite series, such as Taylor series. The sine 
function, for example, can be represented for all real x by the infinite power 
series 

$ 1 3 1 5 7 
sngt =t pt + ae at +... 


3! 5! 7! 
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The most interesting thing to note about this is that the sine function has 
properties that none of the individual power functions possess; for example, 
it is bounded and periodic. So we see that an infinite sum of simple functions 
can be qualitatively different from the components in the sum. If we take 
the sum of only finitely many terms in the Taylor series for the sine function 
we get a polynomial, which cannot provide a good approximation of the 
sine function for all x; that is, the finite sum does not approximate the sine 
function uniformly over the real line. The approximation is better for x 
near zero and poorer as we move away from zero. However, for any selected 
x and for any e€ > 0, there is a positive integer N, depending on the x and 
on the e, with the sum of the first n terms of the series within € of sinx 
for n > N; that is, the series converges pointwise to sinx for each real z. 
In Fourier analysis the trigonometric functions themselves are viewed as 
the simple functions, and we try to build more complicated functions as 
(possibly infinite) sums of trig functions. In wavelet analysis we have more 
freedom to design the simple functions to fit the problem at hand. 


15.4 Signal Analysis 


When we speak of signal analysis, we often mean that we believe the 
signal to be a superposition of simpler signals of a known type and we wish 
to know which of these simpler signals are involved and to what extent. For 
example, received sonar or radar data may be the superposition of individ- 
ual components corresponding to spatially localized targets of interest. As 
we shall see in our discussion of the ambiguity function and of wavelets, 
we want to tailor the family of simpler signals to fit the physical problem 
being considered. 

Sometimes it is not the individual components that are significant by 
themselves, but groupings of these components. For example, if our received 
signal is believed to consist of a lower frequency signal of interest plus a 
noise component employing both low and high frequencies, we can remove 
some of the noise by performing a low-pass filtering. This amounts to an- 
alyzing the received signal to determine what its low-pass and high-pass 
components are. We formulate this operation mathematically using the 
Fourier transform, which decomposes the received signal f(t) into complex 
exponential function components corresponding to different frequencies. 

More generally, we may analyze a signal f(t) by calculating certain inner 
products (f, gn); n = 1, ..., N. We may wish to encode the signal using these 
N numbers, or to make a decision about the signal, such as recognizing 
a voice. If the signal is a two-dimensional image, say a fingerprint, we 
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may want to construct a data-base of these N-dimensional vectors, for 
identification. In such a case we are not necessarily claiming that the signal 
f(t) is a superposition of the gn (t) in any sense, nor do we necessarily expect 
to reconstruct f(t) at some later date from the stored inner products. For 
example, one might identify a piece of music using only the upward or 
downward progression of the first few notes. 

There are many cases, on the other hand, in which we do wish to recon- 
struct the signal f(t) from measurements or stored compressed versions. 
In such cases we need to consider this when we design the measuring or 
compression procedures. For example, we may have values of the signal or 
its Fourier transform at some finite number of points and want to recapture 
f(t) itself. Even in those cases mentioned previously in which reconstruc- 
tion is not desired, such as the fingerprint case, we do wish to be reasonably 
sure that similar vectors of inner products correspond to similar signals and 
distinct vectors of inner products correspond to distinct signals, within the 
obvious limitations imposed by the finiteness of the stored inner products. 
The twin processes of analysis and synthesis are dealt with mathematically 
using the notions of frames and bases. 


15.5 Practical Considerations in Signal Analysis 


Perhaps the most basic problem in signal analysis is determining which 
sinusoidal components make up a given signal. Let the analog signal f(t) 
be given for all real t by 


J 
FO) = Do Aye", 
j=l 


where the A; are complex amplitudes and the w; are real numbers. If we 
view the variable ¢ as time, then the wj are frequencies. In theory, we can 
determine J, the wj, and the A; simply by calculating the Fourier transform 
F(w) of f(t). The function F (w) will have Dirac delta components at w = wj 
for each 7, and will be zero elsewhere. Obviously, this is not a practical 
solution to the problem. The first step in developing a practical approach 
is to pass from analog signals, which are functions of the continuous variable 
t, to digital signals or sequences, which are functions of the integers. 

In theoretical discussions of digital signal processing, analog signals are 
converted to discrete signals or sequences by sampling. We begin by choos- 
ing a positive sampling spacing A > 0 and define the nth entry of the 
sequence x = {x(n)} by 

x(n) = f(nA), 
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for all integers n. 


15.5.1 The Discrete Model 


Notice that, since 
eiwind = eilwit End 
for all n, we cannot distinguish frequency w; from wj + at, We try to pele) 
A small enough so that each of the w; we seek lies in the interval (—4, 4). 
If we fail to make A small enough we under-sample, with the result that 
some of the w; will be mistaken for lower frequencies; this is aliasing. Our 
goal now is to process the sequence x to determine J, the wj, and the Aj. 
We do this with matched filtering. 

Every linear shift-invariant system operates through convolution; as- 
sociated with the system is a sequence h, such that, when x is the input 


sequence, the output sequence is y, with 


for each integer n. In theoretical matched filtering we design a whole family 
of such systems or filters, one for each frequency w in the interval (—%, =). 
We then use our sequence x as input to each of these filters and use the 
outputs of each to solve our signal- anal yolg problem. 

For each w in the interval (-4, =) and each positive integer K, we 
consider the shift-invariant linear filter with h = ex,,, where 


1 iwkA 
N 


for |k| < K and ex,,(k) = 0 otherwise. Using x as input to this system, we 
find that the output value y(0) is 


J K 
Oa A (ata, De etl —ws)k | (15.1) 
Recall the following identity for the Dirichlet kernel: 
3 siru _ sin((K + 3)w) 
eae sin($) i 


for sin($) # 0. As K — +00, the inner sum in Equation (15.1) goes to 
zero for every w except w = wj. Therefore the limit, as K — +00, of y(0) is 
zero, if w is not equal to any of the wj, and equals Ay, if w = wj. Therefore, 
in theory, at least, we can successfully decompose the digital signal into its 
constituent parts and distinguish one frequency component from another, 
no matter how close together the two frequencies may be. 
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It is important to note that, to achieve the perfect analysis described 
above, we require noise-free values x(n) and we need to take K to infinity; 
in practice, of course, neither of these conditions is realistic. We consider 
next the practical matter of having only finitely many values of x(n); we 
leave the noisy case for another chapter. 


15.5.2 The Finite-Data Problem 


In reality we have only finitely many values of a(n), say for n = 
—N,...,N. In matched filtering we can only take K < N. For the choice of 
K = N, we get 


PEA (ata ee i: 


j=l = 


for each fixed w different from the wj, and y(0) = A; for w = wj. We can 


then write 
1 sin((w —w,)(N + $)A) 
eae (tn (S) iF 


fal sin((w — wj 


for w not equal to w;. The problem we face for finite data is that the y(0) 
is not necessarily zero when w is not one of the wj. 

In our earlier discussion of signal analysis it was shown that, if we are 
willing to make a simplifying assumption, we can continue as in the infinite- 
data case. The simplifying assumption is that the ae we seek are J of the 
2N+1 frementi equally spaced in the interval ne Z %), beginning with 


a, = SR + GN IDA and ending with agv+1 = <. Therefore, 
O T ai 27m 
m=- N CON aT 


for m=1,...,2N +1. 
Having made this simplifying assumption, we then design the matched 
filters corresponding to the frequencies &n, for n = 1,...,2N + 1. Because 


N N 
SS etm —an)kA = a e277 INFIE 
k=-N k=—N 
sin(2r =" (N + $)) 
sin(7 TINTI) i 


it follows that 


N 
5 ilam- an)kA =0, 
k=—N 
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for m Æ n and it is equal to 2N + 1 when m = n. We conclude that, 
provided the frequencies we seek are among the am, we can determine J 
and the wj. Once we have these pieces of information, we find the A; simply 
by solving a system of linear equations. 


15.6 Frames 


Although in practice we deal with finitely many measurements or in- 
ner product values, it is convenient, in theoretical discussions, to imagine 
that the signal f(t) has been associated with an infinite sequence of in- 
ner products {(f,gn),n = 1,2,...}. It is also convenient to assume that 
fl? = es |f (t)|?dt < +00; that is, we assume that f is in the Hilbert 
space H = L?. The sequence {g,|n = 1,2,...} in any Hilbert space H is 
called a frame for H if there are positive constants A < B such that, for 
all f in A, 


AIFI? < SOME, 9n)/? < BILL. (15.2) 
n=1 


The inequalities in (15.2) define the frame property. A frame is said to be 
tight if A= B. 

To motivate this definition, suppose that f = g — h. If g and h are 
nearly equal, then f is near zero, so that ||/||? is near zero. Consequently, 
the numbers |(f,gn)|? are all small, meaning that (g, gn) is nearly equal to 
(h, gn) for each n. Conversely, if (g, gn) is nearly equal to (h, gn) for each 
n, then the numbers |(f, gn)|? are all small. Therefore, || f|]? is small, from 
which we conclude that g is close to h. The analysis operator is the one 
that takes us from f to the sequence {(f, gn) }, while the synthesis operator 
takes us from the sequence {(f,9n)} to f. This discussion of frames and 
related notions is based on the treatment in Christensen’s book [54]. 

In the case of a finite-dimensional Hilbert space H, any finite set 
{gn, n = 1,..., N} is a frame for the space H of all f that are linear com- 
binations of the gn. 


Ex. 15.1 An interesting example of a frame in H = R? is the so- 
called Mercedes frame: let gi = (0,1), g2 = (—V3/2,—1/2) and g3 = 
(/3/2,—-1/2). Show that for this frame A = B = 3/2, so the Mercedes 
frame is tight. 


Ex. 15.2 Let W =U UV be the union of two orthonormal bases for CY, 
U and V. Show that W is a tight frame with A= B = 2. 
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For example, consider U = {u!,u?,...,u%} the usual orthonormal basis 
for C’, where all the entries of u” are zero, except that u? = 1, and 
V = {v!,v?,...,v%} the Fourier basis, with the mth entry of v” given by 


n 1 p2rimn/N 

m JN 

This particular frame is used often in compressed sensing and compressed 
sampling, as discussed in Chapter 22. 

The JPEG method for compressing images uses a similar frame that is 
the union of a discrete cosine basis and a discrete wavelet basis. The idea is 
that most images that we wish to compress can be represented as a linear 
combination of relatively few discrete cosine vectors and wavelet vectors. 

The frame property in (15.2) provides a necessary condition for stable 
application of the decomposition and reconstruction operators. But it does 
more than that; it actually provides a reconstruction algorithm. The frame 
operator S is given by 


VU, 


co 
SF =X ign) 9n- 
n=1 
The frame property implies that the frame operator is invertible. The dual 
frame is the sequence {S~!gp, n = 1,2,...}. 


Ex. 15.3 Use the definitions of the frame operator S and the dual frame 
to obtain the following reconstruction formulas: 


f= SG Sgn; 


n=1 


and 


$= (F, Sgn) 9n- 


n=1 


If the frame is tight, then the dual frame is {49n; n=1,2,...}; if the frame 
is not tight, inversion of the frame operator is done only approximately. 


15.7 Bases, Riesz Bases, and Orthonormal Bases 


A set of vectors {gn, n = 1,2,...} in H is a basis for H if, for every f in 
H, there are unique constants {c,, n = 1,2,...} with 


co 
f= 5 CnGn- 
n=1 
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A basis is called a Riesz basis if it is also a frame for H. It can be shown that 
a frame is a Riesz basis if the removal of any one element causes the loss of 
the frame property; since the second inequality in Inequality (15.2) is not 
lost, it follows that it is the first inequality that can now be violated for 
some f. A basis {gn|n = 1,2,...} is an orthonormal basis for H if ||gn|| = 1 
for all n and (gn, 9m) = 0 for distinct m and n. 

We know that the complex exponentials 


feq(t) = — 
€n(t) = 

V 27 
form an orthonormal basis for the Hilbert space L?(—7,7) consisting of 
all f supported on (—1, 7) with f7, |f(t)|?dt < +oo. Every such f can be 
written as 


int 


e , =œ < n < oo} 


for 


Gn = (f, en) = 


Consequently, this is true for every f in L?(—7/2,7/2), although the set of 
functions {gn } formed by restricting the {en } to the interval (—7/2, 71/2) is 
no longer a basis for H = L?(—7/2, 7/2). It is still a tight frame with A = 1, 
but is no longer normalized, since the norm of gn in L?(—1/2, 1/2) is 1/V2. 
An orthonormal basis can be characterized as any sequence with ||gn|| = 1 
for all n that is a tight frame with A = 1. The sequence {V/292k, k= 
—oo, ...,0o} is an orthonormal basis for L?(—7/2,7/2), as is the sequence 
{V/292n41, k = —00,...,00}. The sequence {(f, gn), n = —00,..., 00} is re- 
dundant; the half corresponding either to the odd n or to the even n suffices 
to recover f. Because of this redundancy we can tolerate more inaccuracy 
in measuring these values; indeed, this is one of the main attractions of 
frames in signal processing. 


15.8 Radar Problems 


In radar a real-valued function w(t) representing a time-varying voltage 
is converted by an antenna in transmission mode into a propagating elec- 
tromagnetic wave. When this wave encounters a reflecting target an echo 
is produced. The antenna, now in receiving mode, picks up the echo f(t), 
which is related to the original signal by 


f(t) = Av(t — d(#)), 
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where d(t) is the time required for the original signal to make the round trip 
from the antenna to the target and return back at time t. The amplitude A 
incorporates the reflectivity of the target as well as attenuation suffered by 
the signal. As we shall see shortly, the delay d(t) depends on the distance 
from the antenna to the target and, if the target is moving, on its radial 
velocity. The main signal-processing problem here is to determine target 
range and radial velocity from knowledge of f(t) and w(t). 

If the target is stationary, at a distance ro from the antenna, then 
d(t) = 2ro/c, where c is the speed of light. In this case the original signal 
and the received echo are related simply by 


f(t) = Ayt — b), 


for b = 2ro/c. When the target is moving so that its distance to the an- 
tenna, r(t), is time-dependent, the relationship between f and w is more 
complicated. 


Ex. 15.4 Suppose the target is at a distance ro > 0 from the antenna at 
time t = 0, and has radial velocity v, with v > 0 indicating away from the 
antenna. Show that the delay function d(t) is now 


f(t) = Ay (=) (15.3) 


a 
for 
CHV 
= cv 
and 9 
b= <, 
c— v 


Show also that if we select A = (Sey? then energy is preserved; that is, 


IFI = Ill: : 


Ex. 15.5 Let U(w) be the Fourier transform of the signal y(t). Show that 
the Fourier transform of the echo f(t) in Equation (15.8) is then 


F(w) = Aae™ U (aw). 


The basic problem is to determine a and b, and therefore the range and 
radial velocity of the target, from knowledge of f(t) and w(t). An obvious 
approach is to do a matched filter. 
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15.9 The Wideband Cross-Ambiguity Function 


Note that the received echo f(t) is related to the original signal by the 
operations of rescaling and shifting. We therefore match the received echo 
with all the shifted and rescaled versions of the original signal. For each 


a > 0 and real b, let 
t—b 
Pa plt) =Y ( ) . 


a 


The wideband cross-ambiguity function (WCAF) is 


Wea) =f Obal 


In the ideal case the values of a and b for which the WCAF takes on its 
largest absolute value should be the true values of a and b. 

More generally, there will be many individual targets or sources of 
echoes, each having their own values of a, b, and A. The resulting received 
echo function f(t) is a superposition of the individual functions pa b(t), 
which, for technical reasons, we write as 


dadb 
a2 ` 


waf f "Didia (15.4) 


We then have the inverse problem of determining D(b,a) from f(t). 

Equation (15.4) provides a representation of the echo f(t) as a super- 
position of rescaled translates of a single function, namely the original sig- 
nal y(t). We shall encounter this representation again in our discussion of 
wavelets, where the signal y(t) is called the mother wavelet and the WCAF 
is called the integral wavelet transform. One reason for discussing radar and 
ambiguity functions now is to motivate some of the wavelet theory. Our dis- 
cussion here follows closely the treatment in [97], where Kaiser emphasizes 
the important connections between wavelets and radar ambiguity functions. 

As we shall see when we study wavelets in Chapter 16, we can recover 
the signal f(t) from the WCAF using the following inversion formula: at 
points t where f(t) is continuous we have 


roz} f mea (H 


ga ea, 


|u| 


with 


—oco 
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for U(w) the Fourier transform of y(t). The obvious conjecture is then that 
the distribution function D(b, a) is 


D(b,a) = a, Well a). 


However, this is not generally the case. Indeed, there is no particular rea- 
son why the physically meaningful function D(b,a) must have the form 
(W»g)(b, a) for some function g. So the inverse problem of estimating 
D(b,a) from f(t) is more complicated. One approach mentioned in [97] 
involves transmitting more than one signal w(t) and estimating D(b,a) 
from the echoes corresponding to each of the several different transmitted 
signals. 


15.10 The Narrowband Cross-Ambiguity Function 


The real signal w(t) with Fourier transform U(w) is said to be a nar- 
rowband signal if there are constants œ and y such that the conjugate- 
symmetric function W(w) is concentrated on a < |w| < y and 2$ is nearly 

—a 


equal to zero, which means that a is very much greater than 8 = 75°. The 
center frequency is We = ta, 


Ex. 15.6 Leto = 2w.v/c. Show that aw. is approximately equal to we +. 


It follows then that, for w > 0, F(w), the Fourier transform of the echo 
f(t), is approximately Aae”’U(w + ¢). Because the Doppler shift affects 
positive and negative frequencies differently, it is convenient to construct a 
related signal having only positive frequency components. 

Let G(w) = 2F(w) for w > 0 and G(w) = 0 otherwise. Let g(t) be 
the inverse Fourier transform of G(w). Then, the complex-valued function 
g(t) is called the analytic signal associated with f(t). The function f(t) is 
the real part of g(t); the imaginary part of g(t) is the Hilbert transform of 
f(t). Then, the demodulated analytic signal associated with f(t) is h(t) with 
Fourier transform H(w) = G(w+w-). Similarly, let y(t) be the demodulated 
analytic signal associated with y(t). 


Ex. 15.7 Show that the demodulated analytic signals h(t) and y(t) are 
related by 
h(t) = Be'*'y(t — b) = Byg »(t), 
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for B a time-independent constant. Hint: Use the fact that U(w) = 0 for 
O<w<aandd<a. 


To determine the range and radial velocity in the narrowband case 
we again use the matched filter, forming the narrowband cross-ambiguity 
function (NCAF) 


NaO) = (hero) = | MOTE Dy. 

Ideally, the values of ¢ and b corresponding to the largest absolute value of 
N, (¢ġ,b) will be the true ones, from which the range and radial velocity can 
be determined. For each fixed value of b, the NCAF is the Fourier transform 
of the function h(t)y(t — b), evaluated at w = —¢; so the NCAF contains 
complete information about the function h(t). In Chapter 16 on wavelets 
we shall consider the NCAF in a different light, with y playing the role of a 
window function and the NCAF the short-time Fourier transform of h(t), 
describing the frequency content of h(t) near the time b. 

In the more general case in which the narrowband echo function f(t) is 
a superposition of narrowband reflections, 


ro= f | Deas, 


we have sic ots 
h(t) = ‘| f Dual, delta (t — b)dódb, 
—oco JO 


where Dy p(b, ¢) is the narrowband distribution of reflecting target points, 
as a function of b and ¢ = 2w.v/c. The inverse problem now is to estimate 
this distribution, given h(t). 


15.11 Range Estimation 


If the transmitted signal is y(t) = e*t and the target is stationary 
at range r, then the echo received is f(t) = Ae“ (d), where b = 2r/c. 
So our information about r is that we know the value e?'”’/*. Because 
of the periodicity of the complex exponential function, this is not enough 
information to determine r; we need e7“"/° for a variety of values of w. To 
obtain these values we can transmit a signal whose frequency changes with 
time, such as a chirp of the form 


w(t) =e" 
with the frequency 2wt at time t. 
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15.12 Time-Frequency Analysis 


The inverse Fourier transform formula 
fQ=+ Foeta 
Fas we w 


provides a representation of the function of time f(t) as a superposition of 
sinusoids e~** with frequencies w. The value at w of the Fourier transform 


F(w) = ‘a fedt 


is the complex amplitude associated with the sinusoidal component e~*. 
It quantifies the contribution to f(t) made by that sinusoid, over all of t. To 
determine each individual number F (w) we need f(t) for all t. It is implicit 


that the frequency content has not changed over time. 


15.13 The Short-Time Fourier Transform 


To estimate the frequency content of the signal f(t) around the time 
t = b, we could proceed as follows. Multiply f(t) by the function that is 
equal to = on the interval [b — ¢,b + e] and zero otherwise. Then take the 
Fourier transform. The multiplication step is called windowing. 

To see how well this works, consider the case in which f(t) = exp(—iwot) 
for all t. The Fourier transform of the windowed signal is then 


sin(e(w — wo)) 
exp(i(w — wo)b) ae) 
This function attains its maximum value of one at w = wọ. But, the first 
zeros of the function are at |w — wo| = Z, which says that as € gets smaller 
the windowed Fourier transform spreads out more and more around w = 
wo; that is, better time localization comes at the price of worse frequency 
localization. To achieve a somewhat better result we can change the window 
function. 

The standard normal (or Gaussian) curve is 


g(t) = = exp (-3*) 
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which has its peak at t = 0 and falls off to zero symmetrically on either 
side. For o > 0, let 


go(t) = <at/0). 


Then the function g,(t — b) is centered at t = b and falls off on either side, 
more slowly for large o, faster for smaller ø. Also we have 


i golt- b)dt =1 


for each b and o > 0. Such functions were used by Gabor [79] for windowing 
signals and are called Gabor windows. 

Gabor’s idea was to multiply f(t), the signal of interest, by the window 
9o(t — b) and then to take the Fourier transform, obtaining the short-time 
Fourier transform (STFT) 


Gite) = f Oset- betat 


Since g(t — b) falls off to zero on either side of t = b, multiplying by 
this window essentially restricts the signal to a neighborhood of t = b. 
The STFT then measures the frequency content of the signal, near the 
time t = b. The STFT therefore performs a time-frequency analysis of the 
signal. 

We focus more tightly around the time t = b by choosing a small value 
for o. Because of the uncertainty principle, the Fourier transform of the 
window g,(t— b) grows wider as ø gets smaller; the time-frequency window 
remains constant [55]. This causes the STFT to involve greater blurring 
in the frequency domain. In short, to get good resolution in frequency, we 
need to observe for a longer time; if we focus on a small time interval, we 
pay the price of reduced frequency resolution. This is unfortunate because 
when we focus on a short interval of time, it is to uncover a part of the signal 
that is changing within that short interval, which means it must have high 
frequency components within that interval. There is no reason to believe 
that the spacing is larger between those high frequencies we wish to resolve 
than between lower frequencies associated with longer time intervals. We 
would like to have the same resolving capability when focusing on a short 
time interval that we have when focusing on a longer one. 


15.14 The Wigner—Ville Distribution 


In [118] Meyer describes Ville’s approach to determining the instanta- 
neous power spectrum of the signal, that is, the energy in the signal f(t) 
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that corresponds to time ¢ and frequency w. The goal is to find a function 
W,(t,w) having the properties 


J Wit odojn = FOP, 


which is the total energy in the signal at time t, and 


J Wiltwyat = PEP, 


which is the total energy in the Fourier transform at frequency w. Because 
these two properties do not specify a unique W(t, w), two additional prop- 
erties are usually required: 


f [Wre t,w)W,(t,w)dtdw /2n = |f ron geal 


and, for f(t) = go (t — b) expliat), 
W;(t,w) = 2exp(—07° (t — b)’ ) exp(—0° (w — a)?). 


The Wigner-Ville distribution of f(t), given by 


WV; (t,w) = La (: + Z) f (t- Z) exp(—iwT)drT, 


has all four of the desired properties. The Wigner-Ville distribution is 
always real-valued, but its values need not be nonnegative. 

In [65] De Bruijn defines the score of a signal f(t) to be H(2,y;f, f), 
where 


A(z, y; fi, fe) = a fila +t) fo(a — tHe dt. 


Ex. 15.8 Relate the narrowband cross-ambiguity function to the De 
Bruijn’s score and the Wigner-Ville distribution. 
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16.1 Chapter Summary 


In this chapter we present an overview of the theory of wavelets, with 
particular emphasis on their use in signal processing.. 


16.2 Background 


The fantastic increase in computer power over the last few decades 
has made possible, even routine, the use of digital procedures for solving 
problems that were believed earlier to be intractable, such as the modeling 
of large-scale systems. At the same time, it has created new applications 
unimagined previously, such as medical imaging. In some cases the math- 
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ematical formulation of the problem is known and progress has come with 
the introduction of efficient computational algorithms, as with the Fast 
Fourier Transform. In other cases, the mathematics is developed, or per- 
haps rediscovered, as needed by the people involved in the applications. 
Only later is it realized that the theory already existed, as with the de- 
velopment of computerized tomography without Radon’s earlier work on 
reconstruction of functions from their line integrals. 

It can happen that applications give a theoretical field of mathematics 
a rebirth; such seems to be the case with wavelets [95]. Sometime in the 
1980s researchers working on various problems in electrical engineering, 
quantum mechanics, image processing, and other areas became aware that 
what the others were doing was related to their own work. As connections 
became established, similarities with the earlier mathematical theory of 
approximation in functional analysis were noticed. Meetings began to take 
place, and a common language began to emerge around this reborn area, 
now called wavelets. One of the most significant meetings took place in June 
of 1990, at the University of Massachusetts Lowell. The keynote speaker 
was Ingrid Daubechies; the lectures she gave that week were subsequently 
published in the book [64]. 

There are a number of good books on wavelets, such as [97], [11], and 
[159]. A recent issue of the IEEE Signal Processing Magazine has an inter- 
esting article on using wavelet analysis of paintings for artist identification 
[96]. 

Fourier analysis and synthesis concerns the decomposition, filtering, 
compressing, and reconstruction of signals using complex exponential func- 
tions as the building blocks; wavelet theory provides a framework in which 
other building blocks, better suited to the problem at hand, can be used. As 
always, efficient algorithms provide the bridge between theory and practice. 

Since their development in the 1980s wavelets have been used for many 
purposes. In the discussion to follow, we focus on the problem of analyzing a 
signal whose frequency composition is changing over time. As we saw in our 
discussion of the narrowband cross-ambiguity function in radar, the need 
for such time-frequency analysis has been known for quite a while. Other 
methods, such as Gabor’s short time Fourier transform and the Wigner- 
Ville distribution, have also been considered for this purpose. 


16.3 A Simple Example 


Imagine that f(t) is defined for all real t and we have sampled f(t) every 
half-second. We focus on the time interval [0,2). Suppose that f(0) = 1, 
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f(0.5) = —8, f(1) = 2 and f(1.5) = 4. We approximate f(t) within the 
interval [0,2) by replacing f(t) with the step function that is 1 on [0,0.5), 
—3 on [0.5, 1), 2 on [1, 1.5), and 4 on [1.5, 2); for notational convenience, we 
represent this step function by (1, —3, 2,4). We can decompose (1, —3, 2, 4) 
into a sum of step functions 


(1,—3,2,4) = 1(1,1,1,1) — 2(1,1,—1,—1) + 2(1, —1,0,0) 100.4, —1). 


The first basis element, (1, 1, 1, 1), does not vary over a two-second interval. 
The second one, (1,1, —1,—1), is orthogonal to the first, and does not vary 
over a one-second interval. The other two, both orthogonal to the previous 
two and to each other, vary over half-second intervals. We can think of these 
basis functions as corresponding to different frequency components and 
time locations; that is, they are giving us a time-frequency decomposition. 

Suppose we let ¢o(t) be the function that has the value 1 on the interval 
[0, 1) and zero elsewhere, and 79(t) the function that has the value 1 on the 
interval [0,0.5), the value —1 on the interval [0.5,1), and zero elsewhere. 
Then we say that 

do(t) = (1,1, 0,0), 


and 
polt) = (1,-1, 0,0). 
We write 
o_1(t) = (1,1, 1,1) = ¢(0.5t) = oo (2712), 
polt — 1) = (0,0, 1, —1), 
and 


p-lt) T (1, 1,—1, —1) = Wo (0.5t) = o(2~'t). 


So we have the decomposition of (1, —3, 2,4) as 


(1, —3,2, 4) = 1¢_1(t) = 2_i(t) 2vo(t) lyo(t 1). 


In what follows we shall be interested in extending these ideas, to find other 
functions $o(t) and w(t) that lead to bases consisting of functions of the 
form 


W;,n(t) = Wo(2/t — k). 


These will be our wavelet bases. 
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16.4 The Integral Wavelet Transform 


For real numbers b and a Æ 0, the integral wavelet transform (TWT) of 
the signal f(t) relative to the basic wavelet (or mother wavelet) (t) is 


(Wy f)(b, a) = at [se (— Yat 


This function is also the wideband cross-ambiguity function in radar. The 
function ~(t) is also called a window function and, like Gaussian functions, 
it will be relatively localized in time. However, it must also have properties 
quite different from those of Gabor’s Gaussian windows; in particular, we 


want 7 
f voa= 


An example is the Haar wavelet YHaar(t) that has the value +1 for 0 < 
t< z, —1 for ł <t< 1, and 0 otherwise. 

As the scaling parameter a grows larger the wavelet Y(t) grows wider, 
so choosing a small value of the scaling parameter permits us to focus on a 
neighborhood of the time t = b. The IWT then registers the contribution 
to f(t) made by components with features on the scale determined by 
a, in the neighborhood of t = b. Calculations involving the uncertainty 
principle reveal that the IWT provides a flexible time-frequency window 
that narrows when we observe high frequency components and widens for 
lower frequencies [55]. 

Given the integral wavelet transform (Wy f)(b,a), it is natural to ask 
how we might recover the signal f(t). The following inversion formula an- 
swers that question: at points t where f(t) is continuous we have 


nef [ote (t) 0 


with 


for U(w), the Fourier transform of y(t). 


16.5 Wavelet Series Expansions 


The Fourier series expansion of a function f(t) on a finite interval is 
a representation of f(t) as a sum of orthogonal complex exponentials. Lo- 
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calized alterations in f(t) affect every one of the components of this sum. 
Wavelets, on the other hand, can be used to represent f(t) so that local- 
ized alterations in f(t) affect only a few of the components of the wavelet 
expansion. The simplest example of a wavelet expansion is with respect to 
the Haar wavelets. 


Ex. 16.1 Let w(t) = YHaar(t). Show that the functions wjp (t) = w(2/t—k) 
are mutually orthogonal on the interval [0,1], where j = 0,1,... and k = 
0,1,...,27 — 1. 


These functions wj (t) are the Haar wavelets. Every continuous function 
f(t) defined on [0,1] can be written as 


œ 27-1 


FO) = co +X YS cpu wze(t) 


j=0 k=0 


for some choice of co and cjg. Notice that the support of the function w;x(t), 
the interval on which it is nonzero, gets smaller as j increases. Therefore, 
the components corresponding to higher values of 7 in the Haar expansion 
of f(t) come from features that are localized in the variable t; such features 
are transients that live for only a short time. Such transient components 
affect all of the Fourier coefficients but only those Haar wavelet coefficients 
corresponding to terms supported in the region of the disturbance. This 
ability to isolate localized features is the main reason for the popularity of 
wavelet expansions. 

The orthogonal functions used in the Haar wavelet expansion are them- 
selves discontinuous, which presents a bit of a problem when we represent 
continuous functions. Wavelets that are themselves continuous, or better 
still, differentiable, should do a better job representing smooth functions. 

We can obtain other wavelet series expansions by selecting a basic 
wavelet a(t) and defining w,x(t) = 24/?y)(2%t — k), for integers j and k. 
We then say that the function w(t) is an orthogonal wavelet if the family 
{vj} is an orthonormal basis for the space of square-integrable functions 
on the real line, the Hilbert space L?(R). This implies that for every such 
f(t) there are coefficients cjg so that 


fO= > YS ceded), 
j=—œ k=—00 


with convergence in the mean-square sense. The coefficients cj, are found 
using the IWT: 


k 1 
cjk = (Wy F) (£ z) : 
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It is also of interest to consider wavelets ù for which {w,;,} form a basis, 
but not an orthogonal one, or, more generally, form a frame, in which the 
series representations of f(t) need not be unique. 

As with Fourier series, wavelet series expansion permits the filtering of 
certain components, as well as signal compression. In the case of Fourier 
series, we might attribute high frequency components to noise and achieve 
a smoothing by setting to zero the coefficients associated with these high 
frequencies. In the case of wavelet series expansions, we might attribute to 
noise localized small-scale disturbances and remove them by setting to zero 
the coefficients corresponding to the appropriate j and k. For both Fourier 
and wavelet series expansions we can achieve compression by ignoring those 
components whose coefficients are below some chosen level. 


16.6 Multiresolution Analysis 


One way to study wavelet series expansions is through multiresolution 
analysis (MRA) [115]. Let us begin with an example involving band-limited 
functions. This example is called the Shannon MRA. 


16.6.1 The Shannon Multiresolution Analysis 


Let Vo be the collection of functions f(t) whose Fourier transform F (w) 
is zero for |w| > a; so Vo is the collection of 7-band-limited functions. 
Let V; be the collection of functions f(t) whose Fourier transform F'(w) is 
zero for |w| > 27; so Vi is the collection of 27-band-limited functions. In 
general, for each integer j, let V; be the collection of functions f(t) whose 
Fourier transform F(w) is zero for |w| > 27; so V; is the collection of 
21 7-band-limited functions. 


Ex. 16.2 Show that if the function f(t) is in V; then the function g(t) = 
f(2t) ts in Vj41. 


We then have a nested sequence of sets of functions {V;}, with V; C Vj41 
for each integer j. The intersection of all the V; is the set containing only 
the zero function. Every function in L?(R) is arbitrarily close to a function 
in at least one of the sets Vj; more mathematically, we say that the union 
of the V; is dense in L?(R). In addition, we have f(t) in V; if and only if 
g(t) = f(2t) is in Vj+ı. In general, such a collection of sets of functions 
is called a multiresolution analysis for L?(IR). Once we have a MRA for 
L?(R), how do we get a wavelet series expansion? 
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A function ¢(t) is called a scaling function or sometimes the father 
wavelet for the MRA if the collection of integer translates {@(t — k)} forms 
a basis for Vo (more precisely, a Riesz basis). Then, for each fixed j, the 
functions $;x(t) = ¢(2’t — k), for integer k, will form a basis for Vj. In the 
case of the Shannon MRA, the scaling function is g(t) = “24. But how 
do we get a basis for all of L?(R)? 


16.6.2 The Haar Multiresolution Analysis 


To see how to proceed, it is helpful to return to the Haar wavelets. Let 
OHaar(t) be the function that has the value +1 for 0 < t < 1 and zero 
elsewhere. Let Vo be the collection of all functions in L?(R) that are linear 
combinations of integer translates of #(t); that is, all functions f(t) that 
are constant on intervals of the form |k, k + 1), for all integers k. Now Vj is 
the collection of all functions g(t) of the form g(t) = f(2t), for some f(t) 
in Vo. Therefore, V, consists of all functions in L?(R) that are constant on 
intervals of the form [k/2, (k + 1)/2). 

Every function in Vo is also in V; and every function g(t) in Vı can be 
written uniquely as a sum of a function f(t) in Vo and a function h(t) in 
Vı that is orthogonal to every function in Vo. For example, the function 
g(t) that takes the value +3 for 0 < t < 1/2, —1 for 1/2 < t < 1, and zero 
elsewhere can be written as g(t) = f(t)+ A(t), where h(t) has the value +2 
for 0 < t < 1/2, —2 for 1/2 < t < 1, and zero elsewhere, and f(t) takes the 
value +1 for 0 < t < 1 and zero elsewhere. Clearly, h(t), which is twice the 
Haar wavelet function, is orthogonal to all functions in Vo. 


Ex. 16.3 Show that the function f(t) can be written uniquely as f(t) = 
d(t)+e(t), where d(t) is in V_1 and e(t) is in Vo and is orthogonal to every 
function in V_,. Relate the function e(t) to the Haar wavelet function. 


16.6.3 Wavelets and Multiresolution Analysis 


To get an orthogonal wavelet expansion from a general MRA, we write 
the set Vı as the direct sum V; = Vo © Wo, so every function g(t) in Vi 
can be uniquely written as g(t) = f(t) + A(t), where f(t) is a function in 
Vo and h(t) is a function in Wo, with f(t) and h(t) orthogonal. Since the 
scaling function or father wavelet ¢(t) is in Vi, it can be written as 


OO 


plt) = $. polt- k), (16.1) 


k=—0o 


for some sequence {px} called the two-scale sequence for $(t). This most 
important identity is the scaling relation for the father wavelet. The mother 
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wavelet is defined using a similar expression 


w(t) = X (1) Prot — k). (16.2) 
k 
We define 
din (t) = 27/°9(27t — k) (16.3) 
and 
byn(t) = 29/?p(2?t — k). (16.4) 


The collection {Wj (t), —oo < j,k < co} then forms an orthogonal wavelet 
basis for L?(R). For the Haar MRA, the two-scale sequence is po = pı = 1 
and px = 0 for the rest. 


Ex. 16.4 Show that the two-scale sequence {pp} has the properties 


Pk = 2 | ooe — k)dt; 


oo 
X Pk-2mPk = 0, 


k=—0o 


form #0 and equals 2 when m = 0. 


16.7 Signal Processing Using Wavelets 


Once we have an orthogonal wavelet basis for L?(R), we can use the 
basis to represent and process a signal f(t). Suppose, for example, that f(t) 
is band-limited but essentially zero for t not in [0,1] and we have samples 
f (+); k = 0,..., M. We assume that the sampling rate A = wu is faster 
than the Nyquist rate so that the Fourier transform of f(t) is zero outside, 
say, the interval [0, 2714]. Roughly speaking, the W; component of f(t), 
given by 

20-1 


g= > Att 
k=0 


with 81 = (f(t), Yjk(t)), corresponds to the components of f(t) with fre- 
quencies w between 2171 and 23. For 2f > 21M we have 6} = 0, so 
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g(t) = 0. Let J be the smallest integer greater than log,(27) + log,(M). 
Then, f(t) is in the space Vy and has the expansion 


27-1 
FO = Do akon), 
k=0 
for af = (f(t), sx(t)). It is common practice, but not universally ap- 


proved, to take M = 27 and to estimate the aj by the samples f (+4). 
Once we have the sequence {az}, we can begin the decomposition of f(t) 
into components in V; and W; for j < J. As we shall see, the algorithms 
for the decomposition and subsequent reconstruction of the signal are quite 
similar to the FFT. 


16.7.1 Decomposition and Reconstruction 


The decomposition and reconstruction algorithms both involve the 
equation 


5 al dik z 5 ain Pj) + bo YG-1),m ; (16.5) 
k m 


in the decomposition step we know the {a} and want the {a/7!} and 
{bi}, while in the reconstruction step we know the {a}; '} and {b} '} 
and want the {aj}. 


Using Equations (16.1) and (16.3), we obtain 


$G-1),1 = ge XO Pibj,(e+21) SOA So pr—21b jk: (16.6) 
k k 


using Equations (16.2), (16.3) and (16.4), we get 
VG-) = 374/2 NO (CD PIETA jk- (16.7) 
k 


Therefore, 


(bik: 0G-1),1) = 2 Va: (16.8) 


this comes from substituting @(;-1), as in Equation (16.6) into the second 
term in the inner product. Similarly, we have 


(je, ¥G—1)a) = 2717? (-1)* pie par. (16.9) 


These relationships are then used to derive the decomposition and recon- 
struction algorithms. 
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16.7.1.1 The Decomposition Step 


To find aj ~! we take the inner product of both sides of Equation (16.5) 
with the function ¢(;~-1). Using Equation (16.8) and the fact that ġ(j—1), 
is orthogonal to all the ¢j-1),m except for m = l and is orthogonal to all 
the ~yj—1),m, we obtain 


—1/2 J OIL, 
2mo 5 ap Pk—2l = a; ; 
k 


similarly, using Equation (16.9), we get 


ge 5 al (—1)"pi-k+2 = bit, 
k 


The decomposition step is to apply these two equations to get the fal oh 
and {b771} from the {aj}. 


16.7.1.2 The Reconstruction Step 


Now we use Equations (16.6) and (16.7) to substitute into the right 
hand side of Equation (16.5). Combining terms, we get 


ah, = 271/2 X at pp-a + bY *(—1)* P era. 
l 


This takes us from the {a)~'} and {b171} to the {a7}. 

We have assumed that we have already obtained the scaling function 
o(t) with the property that {¢(t — k)} is an orthogonal basis for Vo. But 
how do we actually obtain such functions? 


16.8 Generating the Scaling Function 


The scaling function ¢(t) is generated from the two-scale sequence {px} 
using the following iterative procedure. Start with ¢o0(t) = ¢Haar(t), the 
Haar scaling function that is 1 on [0,1] and 0 elsewhere. Now, for each 
n =1,2,..., define 


On(t) 7 5 PrOn—1(2t — k). 


k=- 


Provided that the sequence {px} has certain properties to be discussed 
below, this sequence of functions converges and the limit is the desired 
scaling function. 
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The properties of {pp} that are needed can be expressed in terms of 
properties of the function 


For the Haar MRA, this function is P(z) = (1+ z). We require that 
‘PRS, 
2. |P(e%)|? + |P(e?+™)) |? = 1, for 0 < 0 <7, and 


3. |P(e)| > 0 for -4 < 0< 3. 


16.9 Generating the Two-Scale Sequence 


The final piece of the puzzle is the generation of the sequence {px } itself, 
or, equivalently, finding a function P(z) with the properties listed above. 
The following example, also used in [11], illustrates Ingrid Daubechies’ 
method [63]. 

We begin with the identity 


0 0 
cos? 3 + sin? 3 = 1 


and then raise both sides to an odd power n = 2N — 1. Here we use N = 2, 
obtaining 


0 0 0 0 0 0 
1 = cos® 3 +3 cost 5 sin” 3 + cos? ern) +3cos* ( +0) sin? ( see 
We then let 4 9 9 
| P(e’) |? = cos? 5 + 3cos* 5 sin? 3 
so that 


P(e) + P(e) = 1 
for 0 < 6 < m. Now we have to find P(e’). Writing 


6 0 0 
| P(e) |? = cost 3 (cos? z + 3sin? 5) F 
we have 


; 0 0 ; 
P(e?) = cos? f (cos 5 + V3i sin £) eae). 
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where the real function a(@) is arbitrary. Selecting a(@) = 38, we get 


210 310 
+ p3e, 


P(e?) = po + pie” + poe 


for 
_1+v3 
Po = 4 , 
_ 34+Vv3 
Pı = 4 ’ 
_3-¥3 
Prr 4 ’ 
E= 
P3 4 , 


and all the other coefficients are zero. The resulting Daubechies’ wavelet is 
compactly supported and continuous, but not differentiable [11, 63]. Figure 
16.1 shows the scaling function and mother wavelet for N = 2. When larger 
values of N are used, the resulting wavelet, often denoted wy (t), which is 
again compactly supported, has approximately N/5 continuous derivatives. 

These notions extend to nonorthogonal wavelet bases and to frames. 
Algorithms similar to the fast Fourier transform provide the wavelet de- 
composition and reconstruction of signals. The recent text by Boggess and 
Narcowich [11] is a nice introduction to this fast-growing area; the more 
advanced book by Chui [55] is also a good source. Wavelets in the context 
of Riesz bases and frames are discussed in Christensen’s book [54]. Appli- 
cations of wavelets to medical imaging are found in [127], as well as in the 
other papers in that special issue. 


16.10 Wavelets and Filter Banks 


In [152] Strang and Nguyen take a somewhat different approach to 
wavelets, emphasizing the role of filters and matrices. To illustrate one of 
their main points, we consider the two-point moving average filter. 

The two-point moving average filter transforms an input sequence x = 
{a(n)} to output y = {y(n)}, with y(n) = $a(n) + $2(n — 1). The filter 
h = {h(k)} has h(0) = h(1) = 4 and all the remaining h(n) are zero. This 
filter is a finite impulse response (FIR) low-pass filter and is not invertible; 
the input sequence with x(n) = (—1)” has output zero. Similarly, the two- 

1 1 


point moving difference filter g = {g(k)}, with g(0) = a g(1) = =, and 


the rest zero, is a FIR high-pass filter, also not invertible. However, if we 
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Daubechies: N = 2 
1.5 T T T T T 


t scaling function 4 


LF mother wavelet J 


0 bf XN 


FIGURE 16.1: Daubechies’ scaling function and mother wavelet for N = 
2. 


perform these filters in parallel, as a filter bank, no information is lost and 
the input can be completely reconstructed, with a unit delay. In addition, 
the outputs of the two filters contain redundancy that can be removed by 
decimation, which is taken here to mean downsampling, that is, throwing 
away every other term of a sequence. 

The authors treat the more general problem of obtaining perfect recon- 
struction of the input from the output of a filter bank of low- and high-pass 
filters followed by downsampling. The properties that must be required of 
the filters are those we encountered earlier with regard to the two-scale se- 
quences for the father and mother wavelets. When the filter operations are 
construed as matrix multiplications, the decomposition and reconstruction 
algorithms become matrix factorizations. 


262 Signal Processing: A Mathematical Approach 


16.11 Using Wavelets 


We consider the Daubechies mother wavelet Yy (t), for N = 1,2,..., and 
n = 2N — 1. The two-scale sequence {pp} then has nonzero terms po, ..., Pn- 
For example, when N = 1, we get the Haar wavelet, with po = pi = 1/2, 
and all the other pk = 0. 

The wavelet signal analysis usually begins by sampling the signal f(t) 
closely enough so that we can approximate the alt! by the samples 


f(k/217}). 
An important aspect of the Daubechies wavelets is the vanishing of 
moments. For k = 0,1,..., N — 1 we have 


Jonoa =0; 
for the Haar case we have only that f yı(t)dt = 0. We consider now the 


significance of vanishing moments for detection. 
For an arbitrary signal f(t) the wavelet coefficients bj, are given by 


bi = O 
We focus on N = 2. 


The function w2(2’t — k) is supported on the interval [k/2, (k + 3)/21] 
so we have 


; 3/2 ' l 
bi. = f+ k/2 p(X t)dt. 
0 
If f(t) is smooth near t = k/2/, and j is large enough, then 
; l 1 7 
FE + k/2?) = f(k/2) + FKZ Yt + SP D +, 
and so 


3/27 
Cee le (rarm) | bo (2/t)dt 
0 


3/25 


o pB” l ; : 
+ rea) | bandt + f'(a) | Puad); 


Since 


foda = fra =0 
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and 
[Pesta ~ ENES 
2m’ 
we have 
EL 3 953/2 P" (k J2), 
k 16 V 27 


On the other hand, if f(t) is not smooth near t = k/2/, we expect the b! 
to have a larger magnitude. 


Example 1 Suppose that f(t) is piecewise linear. Then f”(t) = 0, except 
at the places where the lines meet. So we expect the bj, to be zero, except 
at the nodes. 


Example 2 Let f(t) = t(1—t), for t € [0, 1], and zero elsewhere. We might 
begin with the sample values f(k/2") and then consider b$. Again using 
N = 2, we find that b ~ f’”(k/2°) = 2, independent of k, except near the 
endpoints t = 0 and t = 1. The discontinuity of f’(t) at the ends will make 
the bÊ there larger. 


Example 3 Now let g(t) = t?(1 — t)?, for t € [0,1], and zero elsewhere. 
The first derivative is continuous at the endpoints t = 0 and t = 1, but the 
second derivative is discontinuous there. Using N = 2, we won’t be able to 
detect this discontinuity, but using N = 3 we will. 


Example 4 Suppose that f(t) = e*t. Then we have 
b) = 2795/2 etek? Wn (w/2!), 


independent of k, where Yy denotes the Fourier transform of wy . If we 
plot these values for various 7, the maximum is reached when 


w/2) = argmax Vy, 


from which we can find w. 
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17.1 Chapter Summary 


In most signal- and image-processing applications the measured data 
includes (or may include) a signal component we want and unwanted com- 
ponents called noise. Estimation involves determining the precise nature 
and strength of the signal component; deciding if that strength is zero or 
not is detection. 

Noise often appears as an additive term, which we then try to remove. 
If we knew precisely the noisy part added to each data value we would 
simply subtract it; of course, we never have such information. How then do 
we remove something when we don’t know what it is? Statistics provides a 
way out. 

The basic idea in statistics is to use procedures that perform well on 
average, when applied to a class of problems. The procedures are built 
using properties of that class, usually involving probabilistic notions, and 
are evaluated by examining how they would have performed had they been 
applied to every problem in the class. To use such methods to remove 
additive noise, we need a description of the class of noises we expect to 
encounter, not specific values of the noise component in any one particular 
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instance. We also need some idea about what signal components look like. 
In this chapter we discuss solving this noise removal problem using the best 
linear unbiased estimation (BLUE). We begin with the simplest case and 
then proceed to discuss increasingly complex scenarios. 

An important application of the BLUE is in Kalman filtering. The con- 
nection between the BLUE and Kalman filtering is best understood by 
considering the case of the BLUE with a prior estimate of the signal com- 
ponent, and mastering the various matrix manipulations that are involved 
in this problem. These calculations then carry over, almost unchanged, to 
the Kalman filtering. 

Kalman filtering is usually presented in the context of estimating a 
sequence of vectors evolving in time. Kalman filtering for image processing 
is derived by analogy with the temporal case, with certain parts of the 
image considered to be in the “past” of a fixed pixel. 


17.2 The Simplest Case 


Suppose our data is zj =c+ vj, for j = 1,..., J, where c is an unknown 
constant to be estimated and the v; are additive noise. We assume that 
E(v;) = 0, E(vjūk) = 0 for j # k, and E(||v;||?) = oF. So, the additive 
noises are assumed to have mean zero and to be independent (or at least 
uncorrelated). In order to estimate c, we adopt the following rules: 


1. The estimate ĉis linear in the data z = (2,..., 27)"; that is, é = ktz, 
for some vector k = (k1,...,k7)?. 


2. The estimate is unbiased; E(é) = c. This means ae kj =1. 


3. The estimate is best in the sense that it minimizes the expected error 
squared; that is, E(|é — c|?) is minimized. 


Ex. 17.1 Show that the resulting vector k is 


J 
—2 > —2 
ki — Oi / Oj ; 
j=l 


and the BLUE estimator of c is then 
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Ex. 17.2 Suppose we have data z1 = c+v, and z2 =c+v2 and we want to 
estimate the constant c. Assume that E(v,) = E(v2) = 0 and E(viv2) = p, 
with 0 < |p| < 1. Find the BLUE estimate of c. 


Ex. 17.3 The concentration of a substance in solution decreases exponen- 
tially during an experiment. Noisy measurements of the concentration are 
made at times tı and t2, giving the data 


zi = toe“ +v, i= 1,2, 


where the v; have mean zero, and are uncorrelated. Find the BLUE for the 
initial concentration xo. 


17.3 A More General Case 


Suppose now that our data vector is z = Hx+ v. Here, x is an unknown 
vector whose value is to be estimated. The random vector v is additive noise 
whose mean is E(v) = 0 and whose known covariance matrix Q = E(vv') 
is invertible and not necessarily diagonal. The known matrix H is J by N, 
with J > N. We seek an estimate of the vector x, using the following rules: 


1. The estimate X must have the form x = Ktz, where the matrix K is 
to be determined. 


2. The estimate is unbiased; that is, E(x) = x. 


3. The K is determined as the minimizer of the expected squared error; 
that is, once again we minimize E(||X — x||?). 


Ex. 17.4 Show that for the estimator to be unbiased we need K'H = I, 
the identity matrix. 


Ex. 17.5 Show that 

E(\|X — x||?) = trace K'QK. 
Hints: Write the left side as 

B(trace ((% — x)(& — x)')). 


Also use the fact that the trace and expected-value operations commute. 
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The problem then is to minimize trace K'QK subject to the constraint 
equation KH = I. We solve this problem using a technique known as 
prewhitening. 

Since the noise covariance matrix Q is Hermitian and nonnegative def- 
inite, we have Q = UDU', where the columns of U are the (mutually 
orthogonal) eigenvectors of Q and D is a diagonal matrix whose diago- 
nal entries are the (necessarily nonnegative) eigenvalues of Q; therefore, 
UU = I. We call C = UD'/?Ut the Hermitian square root of Q, since 
Ct = C and C? = Q. We assume that Q is invertible, so that C is also. 
Given the system of equations 


z= Hx-+v, 
as before, we obtain a new system 
y=Gx+w 


by multiplying both sides by C7! = Q-1/?; here, G = C'H and w = 
Ctv. The new noise correlation matrix is 


E(wwŻ) = CQC! =], 


so the new noise is white. For this reason the step of multiplying by C7! 
is called prewhitening. 
With J = CK and M = C~!H, we have 


K'QK = JJ 


and 
KH = JİM. 


Our problem then is to minimize trace JJ, subject to J'M = I. Recall 
that the trace of the matrix AtA is simply the square of the 2-norm of the 
vectorization of A. 

Our solution method is to transform the original problem into a simpler 
problem, where the answer is obvious. 

First, for any given matrices L and M such that J and ML have the 
same dimensions, the minimum value of 


f(J) = trace[(J' — L’ MŻ)(J — ML)| 


is zero and occurs when J = ML. 

Now let L = L' = (MtM)~?. The solution is again J = ML, but now 
this choice for J has the additional property that J'M = I. So, minimizing 
f(J) is equivalent to minimizing f(J) subject to the constraint JM = I 
and both problems have the solution J = ML. 
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Now using J'M = I, we expand f(J) to get 


f(J) trace[J' J — Jİ ML — Lİ MÝ J + LİM’ ML] 


= trace[J'J — L- Lt + L' MML]. 


The only term here that involves the unknown matrix J is the first one. 
Therefore, minimizing f (J) subject to J’ M = T is equivalent to minimizing 
trace J' J subject to JIM = I, which is our original problem. Therefore, 
the optimal choice for J is J = ML. Consequently, the optimal choice for 
K is 
K = QHL = Q H(HŻQ'H)', 
and the BLUE estimate of x is 
xgLur = Ê = Ktz = (HQH) tH’ Q tz. 


The simplest case can be obtained from this more general formula by taking 
N =1, H = (1,1,...,1)f and x = c. 

Note that if the noise is white, that is, Q = 07, then $ = (H'H)~!H'z, 
which is the least-squares solution of the equation z = Hx. The effect of 
requiring that the estimate be unbiased is that, in this case, we simply 
ignore the presence of the noise and calculate the least-squares solution of 
the noise-free equation z = Hx. 

The BLUE estimator involves nested inversion, making it difficult to 
calculate, especially for large matrices. In the exercise that follows, we 
discover an approximation of the BLUE that is easier to calculate. 


Ex. 17.6 Show that for e > 0 we have 
(Hig H+ 1) 'H'Q"! = H'(HH' +Q). (17.1) 
Hint: Use the identity 
HQ (HH' + eQ) = (HQH + eI)H'. 
It follows from Equation (17.1) that 
XBLUE = lim H (HHİ + eQ)+z. 


Therefore, we can get an approximation of the BLUE estimate by selecting 
€ > 0 near zero, solving the system of linear equations 


(HH +Q)a=z 


for a and taking x = Hta. 
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17.4 Some Useful Matrix Identities 


In the exercise that follows we consider several matrix identities that 
are useful in developing the Kalman filter. 


Ex. 17.7 Establish the following identities, assuming that all the products 
and inverses involved are defined: 


CDA" B(C~' — DA'B)"' = (C7! — DAB)! — C; (17.2) 


(A — BCD)! = A`! + A™'B(C7' — DAB) DA}; (17.3) 
A`tB(CT' — DA“'B)"' = (A — BCD)" BC; (17.4) 


(A— BCD) = (I + GD)A™}, (17.5) 


for 
G = A™'B(C7t = DAB) +. 
Hints: To get Equation (17.2) use 
C(C7} — DAB) = I — CDA™' B. 


For the second identity, multiply both sides of Equation (17.3) on the left 
by A— BCD and at the appropriate step use Equation (17.2). For Equation 
(17.4) show that 


BC(C7t — DA 'B)=B-BCDA 'B=(A-—BCD)A''B. 


For Equation (17.5), substitute what G is and use Equation (17.3). 


17.5 The BLUE with a Prior Estimate 


In Kalman filtering we want to estimate an unknown vector x given 
measurements z = Hx + v, but also given a prior estimate y of x. It is the 
case there that E(y) = E(x), so we write y = x + w, with w independent 
of both x and v and E(w) = 0. The covariance matrix for w we denote by 
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E(ww') = R. We now require that the estimate £ be linear in both z and 
y; that is, the estimate has the form 


£= Cz + D'y, 


for matrices C and D to be determined. 
Our approach is to apply the BLUE to the combined system of linear 
equations 
z= Hx+v and 


y=xtw. 


In matrix language this combined system becomes u = Jx +n, with uf = 
[zt yT], JT = [HT IT], and n? = |v" wT]. The noise covariance matrix 


becomes Q 
0 
p=[2 9 


The BLUE estimate is Ktu, with KJ = I. Minimizing the variance, we 
find that the optimal Kt is 


KÝ = (J PIJ) tJI P. 
The optimal estimate is then 


å = (HQH + RH HH'Q tz + Rty). 


Therefore, 
Cİ = (HQH + RHH ' Q! 


and 
DÝ = (AIO 1A + RIR. 


Using the matrix identities in Equations (17.3) and (17.4) we can rewrite 
this estimate in the more useful form 


for 
G = RH (Q + HRA)". (17.6) 


The covariance matrix of the optimal estimator is K'PK, which can be 
written as 


K'PK = (R`! + HQH)! = (I — GĦ)R. 


In the context of the Kalman filter, R is the covariance of the prior estimate 
of the current state, G is the Kalman gain matrix, and K'PK is the pos- 
terior covariance of the current state. The algorithm proceeds recursively 
from one state to the next in time. 
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17.6 Adaptive BLUE 


We have assumed so far that we know the covariance matrix Q corre- 
sponding to the measurement noise. If we do not, then we may attempt 
to estimate Q from the measurements themselves; such methods are called 
noise-adaptive. To illustrate, let the innovations vector be e = z — Hy. 
Then the covariance matrix of e is S = HRH1 + Q. Having obtained an 
estimate S$ of S from the data, we use S—HRH' in place of Q in Equation 
(17.6). 


17.7 The Kalman Filter 


So far in this chapter we have focused on the filtering problem: Given 
the data vector z, estimate x, assuming that z consists of noisy measure- 
ments of Hx; that is, z = Hx+v. An important extension of this problem is 
that of stochastic prediction. Shortly, we discuss the Kalman-filter method 
for solving this more general problem. One area in which prediction plays 
an important role is the tracking of moving targets, such as ballistic mis- 
siles, using radar. The range to the target, its angle of elevation, and its 
azimuthal angle are all functions of time governed by linear differential 
equations. The state vector of the system at time t might then be a vector 
with nine components, the three functions just mentioned, along with their 
first and second derivatives. In theory, if we knew the initial state perfectly 
and our differential equations model of the physics was perfect, that would 
be enough to determine the future states. In practice neither of these is 
true, and we need to assist the differential equation by taking radar mea- 
surements of the state at various times. The problem then is to estimate 
the state at time t using both the measurements taken prior to time t and 
the estimate based on the physics. 

When such tracking is performed digitally, the functions of time are re- 
placed by discrete sequences. Let the state vector at time kAt be denoted 
by xx, for k an integer and At > 0. Then, with the derivatives in the dif- 
ferential equation approximated by divided differences, the physical model 
for the evolution of the system in time becomes 


Xk = Ag—1Xe-1 + Mg-1. 


The matrix A,_ 1, which we assume is known, is obtained from the differen- 
tial equation, which may have nonconstant coefficients, as well as from the 
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divided difference approximations to the derivatives. The random vector 
sequence mz_ 1 represents the error in the physical model due to the dis- 
cretization and necessary simplification inherent in the original differential 
equation itself. We assume that the expected value of my, is zero for each 
k. The covariance matrix is E(m,m!i) = M,. 

At time kAt we have the measurements 


Zk = HkXk + Ve, 


where H;, is a known matrix describing the nature of the linear measure- 
ments of the state vector and the random vector vz is the noise in these 
measurements. We assume that the mean value of vz is zero for each k. 
The covariance matrix is Æ (vevi) = Qk. We assume that the initial state 
vector Xo is arbitrary. 

Given an unbiased estimate x,_1 of the state vector Xķ—1, our prior 
estimate of x; based solely on the physics is 


Yk = Ag—1Xk-1.- 


Ex. 17.8 Show that E(y, — Xk) = 0, so the prior estimate of x, is unbi- 
ased. We can then write Yk = Xk + Wr, with E(w,) = 0. 


17.8 Kalman Filtering and the BLUE 


The Kalman filter [98, 81, 56] is a recursive algorithm to estimate the 
state vector x; at time kAt as a linear combination of the vectors z, and 
yk. The estimate x, will have the form 


3e = Chan + Dh yn, 


for matrices Ck and Dx to be determined. As we shall see, this estimate 
can also be written as 


Xe = Yk + Gk(Zk — Heyer), 


which shows that the estimate involves a prior prediction step, the yz, 
followed by a correction step, in which H;y, is compared to the measured 
data vector Zęķ; such estimation methods are sometimes called predictor- 
corrector methods. 

In our discussion of the BLUE, we saw how to incorporate a prior 
estimate of the vector to be estimated. The trick was to form a larger 
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matrix equation and then to apply the BLUE to that system. The Kalman 
filter does just that. 
The correction step in the Kalman filter uses the BLUE to solve the 
combined linear system 
Zk = HkXk + Vk 


and 
Yk = Xk + We. 


The covariance matrix of X,_1 — Xķ—ı is denoted by Pķ—ı, and we let 
Qk = E(wpw}). The covariance matrix of Yk — Xz is 


cov(yk — Xk) = Rk = Mp_i + Ap-1Pp-1Al_y. 


It follows from our earlier discussion of the BLUE that the estimate of x; 
is 
Xk = yk + Gk(Zk — Hyg), 
with 
Gr = RH} (Qe + Hy RH). 


Then, the covariance matrix of X; — Xx is 
Py = (I — Gy Hy) Re. 


The recursive procedure is to go from Pk—-ı and Mz_, to Rk, then to Gk, 
from which x, is formed, and finally to P,, which, along with the known 
matrix Mk, provides the input to the next step. The time-consuming part 
of this recursive algorithm is the matrix inversion in the calculation of Gx. 
Simpler versions of the algorithm are based on the assumption that the 
matrices Qk are diagonal, or on the convergence of the matrices Gk to a 
limiting matrix G [56]. 

There are many variants of the Kalman filter, corresponding to varia- 
tions in the physical model, as well as in the statistical assumptions. The 
differential equation may be nonlinear, so that the matrices A; depend on 
x,. The system noise sequence {wx} and the measurement noise sequence 
{v} may be correlated. For computational convenience the various func- 
tions that describe the state may be treated separately. The model may 
include known external inputs to drive the differential system, as in the 
tracking of spacecraft capable of firing booster rockets. Finally, the noise 
covariance matrices may not be known a priori and adaptive filtering may 
be needed. We discuss this last issue briefly in the next section. 
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17.9 Adaptive Kalman Filtering 


As in [56] we consider only the case in which the covariance matrix 
Qk of the measurement noise vg is unknown. As we saw in the discussion 
of adaptive BLUE, the covariance matrix of the innovations vector e, = 
Zk — Apyr is 

Sk = Hi RH} + Qk. 
Once we have an estimate for Sk, we estimate Qk using 
Qk = Sk — HR HÌ. 


We might assume that 5S; is independent of k and estimate Sk = S using 
past and present innovations; for example, we could use 


k 
paid Hyy;)( — Hyy;)'. 


17.10 Difficulties with the BLUE 


As we just saw, the best linear unbiased estimate of x, given the ob- 
served vector z = Hx + v, is 


xpiup = (H'Q-'H)'H'Q7!2z, (17.7) 


where Q is the invertible covariance matrix of the mean zero noise vector 
v and H is a J by N matrix with J > N and H'H invertible. Even if we 
know Q exactly, the double inversion in Equation (17.7) makes it difficult 
to calculate the BLUE estimate, especially for large vectors z. It is often 
the case in practice that we do not know Q precisely and must estimate or 
model it. Because good approximations of Q do not necessarily lead to good 
approximations of Q71, the calculation of the BLUE is further complicated. 
For these reasons one may decide to use the least-squares estimate 


xpg = (H'H)'H'z 
instead. We are therefore led to consider when the two estimation methods 
produce the same answers; that is, when we have 
(HH) Ht = (HQH) H Q. (17.8) 
We turn now to a theorem that answers this question. The proof of this 


theorem relies on the results of several exercises, useful in themselves, that 
involve basic facts of linear algebra. 
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17.11 Preliminaries from Linear Algebra 


We begin with some definitions. Let S be a subspace of finite- 
dimensional Euclidean space C7 and Q a J by J Hermitian matrix. We 
denote by Q(S) the set 


Q(S) = {t|there existss € S witht = Qs} 
and by Q~1(S) the set 
Q~ (S) = {ulQue S}. 


Note that the set Q~1(S)) is defined whether or not Q is invertible. 
We denote by S+ the set of vectors u that are orthogonal to every 
member of S; that is, 


S+ = {ulu's = 0, foreverys € S}. 


Let H be a J by N matrix. Then CS(#), the column space of H, is the 
subspace of C7 consisting of all the linear combinations of the columns of 
H. The null space of Ht, denoted N'S(H'), is the subspace of C7 containing 
all the vectors w for which Htw = 0. 


Ex. 17.9 Show that CS(H)+ = NS(H'). Hint: If v € CS(H)+, then 
vi Hx =0 for all x, including x = H'v. 


Ex. 17.10 Show that CS(H) N NS(H') = {0}. Hint: If y = Hx € 
NS(H"') consider ||y||? = yty. 


Ex. 17.11 Let S be any subspace of C’. Show that if Q is invertible and 
Q(S) = S then Q-!(S) = S. Hint: If Qt = Qs then t =s. 


Ex. 17.12 Let Q be Hermitian. Show that Q(S)+ = Q-1(S+) for every 
subspace S. If Q is also invertible then Q~-!(S)+ = Q(S+). Find an example 
of a non-invertible Q for which Q~1(S)+ and Q(S+) are different. 


We assume that Q is Hermitian and invertible and that the matrix Ht H 
is invertible. Note that the matrix H'Q-'H need not be invertible under 
these assumptions. We shall denote by S an arbitrary subspace of C/. 


Ex. 17.13 Show that Q(S) = S if and only if Q(S+) = S+. Hint: Use 
Exercise 17.12. 


Ex. 17.14 Show that if Q(CS(H)) = CS(H) then H'QT'!H is invertible. 
Hint: Show that H'Q-!Hx = 0 if and only if x = 0. Recall that Q~-'Hx € 
CS(H), by Exercise 17.12. Then use Exercise 17.10. 
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17.12 When Are the BLUE and the LS Estimator the 
Same? 


We are looking for conditions on Q and H that imply Equation (17.8), 
which we rewrite as 


Ht = (H'Q71H)(H'H)-1H'Q (17.9) 


or 
H'Tx=0 


for all x, where i i 
T =I- Q H(HİH\'H'Q. 

In other words, we want Tx € NS$(H") for all x. The theorem is the 

following: 


Theorem 17.1 We have Tx € NS(H"') for all x if and only if we have 
Q(CS(H)) = CS(H). 


An equivalent form of this theorem was proven by Anderson in [1]; he 
attributes a portion of the proof to Magness and McQuire [114]. The proof 
we give here is due to Kheifets [100] and is much simpler than Anderson’s 
proof. The proof of the theorem is simplified somewhat by first establishing 
the result in the next exercise. 


Ex. 17.15 Show that if Equation (17.9) holds, then the matrix H'Q7!H 
is invertible. Hint: Recall that we have assumed that CS(H*t) = C? when 
we assumed that H'H is invertible. From Equation (17.9) it follows that 
CS(H'Q-1H)=C!. 


A Proof of Theorem 17.1: Assume first that Q(C'S(H)) = CS(H), 
which, as we now know, also implies Q(NS(H')) = NS(H'), as well as 
Q-1(C'S(H)) = CS(H), Q-!(NS(H')) = NS(H"), and the invertibility 
of the matrix HtQ-!H. Every x € C7 has the form x = Ha +w, for some 
a and w € NS(HŻ). We show that Tx = w, so that Tx € NS(H') for all 
x. We have 

Tx=THa+Tw= 


x— Q-'H(H'H) '|H'QHa-— Q-'H(H'H) Hi Qw. 
We know that QHa = Hb for some b, so that Ha = Q-!Hb. We also 
know that Qw = v € NS(H'*), so that w = Q-'v. Then, continuing our 
calculations, we have 
Tx=x-—Q 'Hb-0=x- Ha=w, 
so Tx € NS(H"). 


278 Signal Processing: A Mathematical Approach 


Conversely, suppose now that Tx € NS(Ht) for all x, which, as we 
have seen, is equivalent to Equation (17.9). We show that Q-1(NS(H1') = 
NS(H"'). First, let v € Q-1(NS(H")); we show v € N.S(H™). We have 


Htv = (H'Q°'H)(H'H)'A'Qv, 


which is zero, since H'Qv = 0. So, we have shown that Q-!(NS(H")) C 
NS(H'). To complete the proof, we take an arbitrary member v of NS(H") 
and show that v is in Q~!(N.S(H")); that is, Qv € NS(H"'). We know that 
Qv = Ha + w, for w € NS(H"), and 


a = (H' H)! HŻÝQv, 


so that 
Ha = H(H'H)'H'Qv. 
Then, using Exercise 17.15, we have 
Qv = H(H'tH) 'HiQviw 
H(H'QH) HIQ Qv +w 
H(H'Q H) Hv +w =w. 


So Qv = w, which is in NS(HŻ). This completes the proof. | 


17.13 A Recursive Approach 


In array processing and elsewhere, it sometimes happens that the matrix 
Q is estimated from several measurements {v”, n = 1,..., N} of the noise 
vector V as 


1a 
Q= 5d vy 
n=1 
Then, the inverses of Q and of H'Q7!H can be obtained recursively, using 
the Sherman—Morrison—Woodbury matrix-inversion identity. 


Ex. 17.16 The Sherman—Morrison—Woodbury Identity Let B be an 
invertible matrix. Show that 


(B—uv')-!=B-1+4+a71(B"lu)(v' Bo), (17.10) 


whenever 
a=1-viB tu #0. 


Show that, if a =0, then the matrix B — uv? has no inverse. 
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Since the matrices involved here are nonnegative definite this denominator 
will always be at least one. The idea is to define Qo = el, for some e > 0, 
and, for n = 1,..., N, 

Qn = Qn-1 + aa alae 
Then, Q,! can be obtained from Q>', and (H'tQ7>!H)-! from 
(H'Q;*,H)~! using the identity in Equation (17.10). 
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18.1 Chapter Summary 


In this chapter we consider the problem of deciding whether or not 
a particular signal is present in the measured data; this is the detection 
problem. The underlying framework for the detection problem is optimal 


estimation and statistical hypothesis testing [81]. 


18.2 The Model of Signal in Additive Noise 


The basic model used in detection is that of a signal in additive noise. 
The complex data vector is x = (#1, 22,...,0n)’. We assume that there 


are two possibilities: 
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Case 1: Noise only 
Br i: SMa las ING 


or 
Case 2: Signal in noise 


Ln = YSn + Zn, 
where z = (21, 22,-.., zn)’ is a complex vector whose entries zn are values 
of random variables that we call noise, about which we have only statistical 
information (that is to say, information about the average behavior), s = 
(s1, S2, ..-, SN)? is a complex signal vector that we may know exactly, or at 
least for which we have a specific parametric model, and y is a scalar that 
may be viewed either as deterministic or random (but unknown, in either 
case). Unless otherwise stated, we shall assume that y is deterministic. 

The detection problem is to decide which case we are in, based on some 
calculation performed on the data x. Since Case 1 can be viewed as a 
special case of Case 2 in which the value of y is zero, the detection problem 
is closely related to the problem of estimating y, which we discussed in the 
chapter dealing with the best linear unbiased estimator, the BLUE. 

We shall assume throughout that the entries of z correspond to random 
variables with means equal to zero. What the variances are and whether or 
not these random variables are mutually correlated will be discussed next. 
In all cases we shall assume that this information has been determined 
previously and is available to us in the form of the covariance matrix Q = 
E(zz') of the vector z; the symbol E denotes expected value, so the entries 
of Q are the quantities Qmn = E(2mZn). The diagonal entries of Q are 
Qnn = 07, the variance of zn. As in Chapter 17, we assume here that Q is 
invertible, which is the typical case. 

Note that we have adopted the common practice of using the same 
symbols, zn, when speaking about the random variables and about the 
specific values of these random variables that are present in our data. The 
context should make it clear to which we are referring. 

In Case 2 we say that the signal power is equal to |y|?4 SiL [sn]? = 
4|7[?s's and the noise power is + *_, o2 = +tr(Q), where tr(Q) is the 
trace of the matrix Q, that is, the sum of its diagonal terms; therefore, the 
noise power is the average of the variances 02. The input signal-to-noise 
ratio (SNRin) is the ratio of the signal power to that of the noise, prior to 
processing the data; that is, 


1 1 
SNRin = ql s's/ vin(@) = ly|?s's/tr(Q). 
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18.3 Optimal Linear Filtering for Detection 


In each case to be considered next, our detector will take the form of a 
linear estimate of y; that is, we shall compute the estimate Ẹ given by 


where b = (b1, b2, ..., by)? is a vector to be determined. The objective is to 
use what we know about the situation to select the optimal b, which will 
depend on s and Q. 

For any given vector b, the quantity 


4=b!lx = yb's+ blz 


is a random variable whose mean value is equal to yb's and whose variance 
is 
var(¥) = E(|\b'z|?) = E(b'zz'b) = bÍ E(zz!)b = b' Qb. 


Therefore, the output signal-to-noise ratio (SNRout) is defined as 
SNRout = |yb's|?/b'Qb. 


The advantage we obtain from processing the data is called the gain asso- 
ciated with b and is defined to be the ratio of the SNRout to SNRin; that 
is, 
_ bts? /(btQb) _ [bts|?tr(Q) 

lyi?(sts)/ér(Q) (bi Qb)(sis) 
The best b to use will be the one for which gain(b) is the largest. So, 


ignoring the terms in the gain formula that do not involve b, we see that 


Tala 
the problem becomes maximize E, for fixed signal vector s and fixed 


noise covariance matrix Q. 
The Cauchy Inequality plays a major role in optimal filtering and de- 
tection. 


gain(b) 


Cauchy’s Inequality: For any vectors a and b we have 
la'b|? < (a‘a)(b'b), 


with equality if and only if a is proportional to b; that is, there is a scalar 
6 such that b = a. 
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Ex. 18.1 Use Cauchy’s Inequality to show that, for any fixed vector a, the 
choice b = Ba maximizes the quantity \b‘a|?/b'b, for any constant B. 


Ex. 18.2 Use the definition of the covariance matrix Q to show that Q 
is Hermitian and that, for any vector y, y‘Qy > 0. Therefore, Q is a 
nonnegative definite matrix and, using its eigenvector decomposition, can 
be written as Q = CCt, for some invertible square matrix C. 


Ex. 18.3 Consider now the problem of maximizing |b's|?/b'Qb. Using 
the two previous exercises, show that the solution is b = BQ7's, for some 
arbitrary constant b. 


We can now use the results of these exercises to continue our discussion. 
We choose the constant 8 = 1/(s'Q7's) so that the optimal b has b's = 1; 
that is, the optimal filter b is 


b = (1/(s'Q-*s))Q™'s, 
and the optimal estimate of y is 
4 = b'x = (1/(s'Q~'s))(s'Q-*x). 


The mean of the random variable ¥ is equal to yb's = y, and the variance is 
equal to 1/(s*Q7 ts). Therefore, the output signal power is |7|?, the output 
noise power is 1/(s'Q~1s), and so the output signal-to-noise ratio (SNRout) 
is 

SNRout = |71?(s'Q7*s). 


The gain associated with the optimal vector b is then 


(5—8) tr(Q) 


maximum gain = 
gain (sts) 


The calculation of the vector C~'x is sometimes called prewhitening since 
C-!x = yC~'!s + C~!z and the new noise vector, C~'z, has the iden- 
tity matrix for its covariance matrix. The new signal vector is C~'s. The 
filtering operation that gives 7 = bx can be written as 


47 = (1/(s'Q~*s))(C7's)'O™'x; 


the term (C~'s)'C~!x is described by saying that we prewhiten, then do 
a matched filter. Now we consider some special cases of noise. 
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18.4 The Case of White Noise 


We say that the noise is white noise if the covariance matrix is Q = 07, 
where J denotes the identity matrix, whose entries on the main diagonal 
have the value 1 and the other entries have the value 0, and ø > 0 is 
the common standard deviation of the zn. This means that the z, are 
mutually uncorrelated (independent, in the Gaussian case) and share a 
common variance. 

In this case the optimal vector b is b = iss and the gain is N. Notice 
that 7 now involves only a matched filter. We consider now some special 
cases of the signal vectors s. 


18.4.1 Constant Signal 
Suppose that the vector s is constant; that is, s = 1 = (1,1,...,1)”. 


Then, we have 
i 


This is the same result we found in our discussion of the BLUE, when we 
estimated the mean value and the noise was white. 


18.4.2 Sinusoidal Signal, Frequency Known 
Suppose that 
s = e(wo) = (exp(—iwo), exp(—2iwo), ..., exp(—Niwo))*, 


where wo denotes a known frequency in [—7,7). Then, b = #e(wo) and 


N 
n 1 ! 
I= 5 Ln explinwo); 
n=1 
so, we see yet another occurrence of the DFT. 


18.4.3 Sinusoidal Signal, Frequency Unknown 


If we do not know the value of the signal frequency wo, a reasonable 
thing to do is to calculate the 47 for each (actually, finitely many) of the 
possible frequencies within [—7,7) and base the detection decision on the 
largest value; that is, we calculate the DFT as a function of the variable 
w. If there is only a single wo for which there is a sinusoidal signal present 
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in the data, the values of 7 obtained at frequencies other than wọ provide 
estimates of the noise power 07, against which the value of Ẹ for wọ can be 
compared. 


18.5 The Case of Correlated Noise 


We say that the noise is correlated if the covariance matrix Q is not a 
multiple of the identity matrix. This means either that the z, are mutually 
correlated (dependent, in the Gaussian case) or that they are uncorrelated, 
but have different variances. 

In this case, as we saw previously, the optimal vector b is 


1 = 
P= Qa? * 


and the gain is 
(s'Q7's) tr(Q) 
sisy 


How large or small the gain is depends on how the signal vector s relates 
to the matrix Q. 

For sinusoidal signals, the quantity sts is the same, for all values of the 
parameter w; this is not always the case, however. In passive detection of 
sources in acoustic array processing, for example, the signal vectors arise 
from models of the acoustic medium involved. For far-field sources in an 
(acoustically) isotropic deep ocean, planewave models for s will have the 
property that sts does not change with source location. However, for near- 
field or shallow-water environments, this is usually no longer the case. 


maximum gain = 


It follows from Exercise 18.3 that the quantity ao achieves its maxi- 
mum value when s is an eigenvector of Q associated with its smallest eigen- 
value, Ax; in this case, we are saying that the signal vector does not look 
very much like a typical noise vector. The maximum gain is then Ax tr(Q). 
Since tr(Q) equals the sum of its eigenvalues, multiplying by tr(Q) serves 
to normalize the gain, so that we cannot get larger gain simply by having 
all the eigenvalues of Q small. 

On the other hand, if s should be an eigenvector of Q associated with 
its largest eigenvalue, say \,, then the maximum gain is A; ‘tr(Q). If the 
noise is signal-like, that is, has one dominant eigenvalue, then tr(Q) is 
approximately A; and the maximum gain is around one, so we have lost 
the maximum gain of N we were able to get in the white-noise case. This 
makes sense, in that it says that we cannot significantly improve our ability 
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to discriminate between signal and noise by taking more samples, if the 
signal and noise are very similar. 


18.5.1 Constant Signal with Unequal- Variance 
Uncorrelated Noise 


Suppose that the vector s is constant; that is, s = 1 = (1,1,...,1)”. 
Suppose also that the noise covariance matrix is Q = diag{aj,...,7n}. 
In this case the optimal vector b has entries 


for m = 1,..., N, and we have 


1 = 
y= Cai D Om Bm: 
This is the BLUE estimate of y in this case. 


18.5.2 Sinusoidal Signal, Frequency Known, in Correlated 
Noise 


Suppose that 
s = e(wo) = (exp(—iwo), exp(—2iwo), ...,exp(—Niwo))*, 


where wo denotes a known frequency in [—7,7). In this case the optimal 
vector b is 1 
b = 
e 


a e e(wo) 


and the gain is 
1 
maximum gain = y eo) Q elwo)]tr(Q). 


How large or small the gain is depends on the quantity q(wo), where 
qlw) = elw) QT te(w). 


The function 1/q(w) can be viewed as a sort of noise power spectrum, 
describing how the noise power appears when decomposed over the various 
frequencies in [—7,7). The maximum gain will be large if this noise power 
spectrum is relatively small near w = wo; however, when the noise is similar 
to the signal, that is, when the noise power spectrum is relatively large 
near w = Wo, the maximum gain can be small. In this case the noise power 
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spectrum plays a role analogous to that played by the eigenvalues of Q 
earlier. 

To see more clearly why it is that the function 1/q(w) can be viewed 
as a sort of noise power spectrum, consider what we get when we apply 
the optimal filter associated with w to data containing only noise. The 
average output should tell us how much power there is in the component of 
the noise that resembles e(w); this is essentially what is meant by a noise 
power spectrum. The result is b'z = (1/q(w))e(w)'Q-+z. The expected 
value of |b'z|? is then 1/q(w). 


18.5.3 Sinusoidal Signal, Frequency Unknown, in 
Correlated Noise 


Again, if we do not know the value of the signal frequency wo, a rea- 
sonable thing to do is to calculate the Ẹ for each (actually, finitely many) 
of the possible frequencies within [—7,7) and base the detection decision 
on the largest value. For each w the corresponding value of Ẹ is 


N 
Alw) = [L/(e)'Q7*e(w))] SJ an exp(inw), 


where a = (a1, 42,...,ay)" satisfies the linear system Qa = x or a= Qu! x. 
It is interesting to note the similarity between this estimation procedure and 
the PDFT discussed earlier; to see the connection, view [1/(e(w)'Q-te(w))| 
in the role of P(w) and Q its corresponding matrix of Fourier-transform val- 
ues. The analogy breaks down when we notice that Q need not be Toeplitz, 
as in the PDFT case; however, the similarity is intriguing. 


18.6 Capon’s Data-Adaptive Method 


When the noise covariance matrix Q is not available, perhaps because 
we cannot observe the background noise in the absence of any signals that 
may also be present, we may use the signal-plus-noise covariance matrix R 
in place of Q. 


Ex. 18.4 Show that for 
R=|y/’ss' +Q 


maximizing the ratio 


|b fo's|?/bt Rb 
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is equivalent to maximizing the ratio 
\b's|?/b' Qb. 


In [49] Capon offered a high-resolution method for detecting and re- 
solving sinusoidal signals with unknown frequencies in noise. His estimator 
has the form 

1/e(w)'R-*e(w). 
The idea here is to fix an arbitrary w, and then to find the vector b(w) that 
minimizes b(w)' Rb(w), subject to b(w)'e(w) = 1. The vector b(w) turns 
out to be i 
b(w) = ——— 
e R eE] 
Now we allow w to vary and compute the expected output of the filter b(w), 
operating on the signal plus noise input. This expected output is then 


R`te(w). 


1/e(w)' R te(w). 


The reason that this estimator resolves closely spaced delta functions better 
than linear methods such as the DFT is that, when w is fixed, we obtain an 
optimal filter using R as the noise covariance matrix, which then includes 
all sinusoids not at the frequency w in the “noise” component. This is 
actually a good thing, since, when we are looking at a frequency w that 
does not correspond to a frequency actually present in the data, we want 
the sinusoidal components present at nearby frequencies to be filtered out, 
to improve resolution. We lose resolution of two nearby peaks in estimators 
like the DFT when the estimator gives a larger value between two actual 
peaks than it does at the peaks themselves. Methods such as Capon’s reduce 
the estimator’s value between the two peaks. 
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19.1 Chapter Summary 


Many methods for analyzing measured signals are based on the idea 
of matching the data against various potential signals to see which ones 
match best. The role of inner products in this matching approach is the 
topic of this chapter. 


19.2 Cauchy’s Inequality 


The matching is done using the complex dot product, eld. In the ideal 
case this dot product is large, for those values of w that correspond to 
an actual component of the signal; otherwise it is small. Why this should 
be the case is the Cauchy-Schwarz Inequality (or sometimes, depending 
on the context, just Cauchy’s Inequality, just Schwarz’s Inequality, or, in 
the Russian literature, Bunyakovsky’s Inequality). The proof of Cauchy’s 
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Inequality rests on four basic properties of the complex dot product. These 
properties can then be used to obtain the more general notion of an inner 
product. 


19.3 The Complex Vector Dot Product 


Let u = (a,b) and v = (c,d) be two vectors in two-dimensional space. 
Let u make the angle a > 0 with the positive x-axis and v the angle 6 > 0. 
Let ||u|| = Va? + b? denote the length of the vector u. Then a = ||ul| cosa, 


b = |ljul|sina, c = ||v||cos8 and d = ||v||sin@. So u- v = ac + bd = 
||ul|||v||(cos a cos 6 + sina sin 8 = ||ul| ||v|| cos(a — 8). Therefore, we have 
u- v = ||{ul| ||v|| cos 8, (19.1) 


where 0 = a — £ is the angle between u and v. Cauchy’s Inequality is 
lu- v| < |ual] |v], 


with equality if and only if u and v are parallel. 

Cauchy’s Inequality extends to vectors of any size with complex entries. 
For example, the complex M-dimensional vectors e,, and eg defined earlier 
both have length equal to VM and 


leLeo| < M, 


with equality if and only if w and 0 differ by an integer multiple of r. 
From Equation (19.1) we know that the dot product u- v is zero if and 
only if the angle between these two vectors is a right angle; we say then 
that u and v are mutually orthogonal. The idea of using the dot product 
to measure how similar two vectors are is called matched filtering; it is a 
popular method in signal detection and estimation of parameters. 


Proof of Cauchy’s Inequality: To prove Cauchy’s Inequality for the 
complex vector dot product, we write u -v = |u- v|et?. Let t be a real 
variable and consider 


0 < |le"?u — tv]|? 


(e~u — tv) - (e7 2u — tv) 

= |lul|? — t[(e7 2u) -v + v (eu)] + llv]? 
= |lul|? — (eu) - v + (e®u) - v] + llv]? 
= |lul|? — 2Re(te“(u- v)) + tlv]? 

= |lull? — 2Re(tlu- v|) + llv]? 

= |lull? — 2¢\u- v| + llv’. 
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This is a nonnegative quadratic polynomial in the variable t, so it can- 
not have two distinct real roots. Therefore, the discriminant 4|u - v|? — 
A\|v||?||ul|?, must be non-positive; that is, Ju-v|? < |[ul|?||v||?. This is 


Cauchy’s Inequality. 


Ex. 19.1 Use Cauchy’s Inequality to show that 
llu + vl < llul] + Ivl]; 


this is called the triangle inequality. 


A careful examination of the proof just presented shows that we did not 
explicitly use the definition of the complex vector dot product, but only 
some of its properties. This suggested to mathematicians the possibility of 
abstracting these properties and using them to define a more general con- 
cept, an inner product, between objects more general than complex vectors, 
such as infinite sequences, random variables, and matrices. Such an inner 
product can then be used to define the norm of these objects and thereby a 
distance between such objects. Once we have an inner product defined, we 
also have available the notions of orthogonality and best approximation. 
We shall address all of these topics shortly. 


19.4 Orthogonality 


Consider the problem of writing the two-dimensional real vector (3, —2) 
as a linear combination of the vectors (1,1) and (1,—1); that is, we want 
to find constants a and b so that (3, —2) = a(1,1) + b(1, -1). One way to 
do this, of course, is to compare the components: 3 = a+b and —2 = a — b; 
we can then solve this simple system for the a and b. In higher dimensions 
this way of doing it becomes harder, however. A second way is to make use 
of the dot product and orthogonality. 

The dot product of two vectors (a, y) and (w, z) in R? is (x, y)-(w,z) = 
xw+yz. If the dot product is zero then the vectors are said to be orthogonal; 
the two vectors (1,1) and (1,—1) are orthogonal. We take the dot product 
of both sides of (3, —2) = a(1, 1) + b(1,—1) with (1,1) to get 


1 = 6.29, 1) = a(1,1)- (1, 1)+b(1, —1)- (1,1) = a(1, 1): (1,1)+0 = 2a, 


so we see that a = 4. Similarly, taking the dot product of both sides with 
(1, —1) gives 


5 = (3,—2)- (1,-1) = a(1,1)- (1,-1) + 64, -1)- (1,—1) = 28, 
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so b = 3. Therefore, (3, —2) = $(1,1) + 3(1,—-1). The beauty of this ap- 
proach is that it does not get much harder as we go to higher dimensions. 


Since the cosine of the angle 0 between vectors u and v is 
cos@ = u-v/|lul|||v||, 


where ||u||? = u - u, the projection of vector v on to the line through the 
origin parallel to u is 


. u:v 
Proj,,(v) = aan 


Therefore, the vector v can be written as 
v= Proja (v) + (v i Proja (v)), 


where the first term on the right is parallel to u and the second one is 
orthogonal to u. 

How do we find vectors that are mutually orthogonal? Suppose we begin 
with (1,1). Take a second vector, say (1,2), that is not parallel to (1,1) and 
write it as we did v earlier, that is, as a sum of two vectors, one parallel 
to (1,1) and the second orthogonal to (1,1). The projection of (1,2) onto 
the line parallel to (1,1) passing through the origin is 


(1,1) - (1,2) E 133 
Ca He) = 349= (5.5) 


GJEDDE 


The vectors (—4, 4) = —$(1, —1) and, therefore, (1,—1) are then orthogo- 


nal to (1,1). This approach is the basis for the Gram-Schmidt method for 
constructing a set of mutually orthogonal vectors. 


SO 


Ex. 19.2 Use the Gram-Schmidt approach to find a third vector in RÌ 
orthogonal to both (1,1,1) and (1,0, —1). 


Orthogonality is a convenient tool that can be exploited whenever we 
have an inner product defined. 


19.5 Generalizing the Dot Product: Inner Products 


The proof of Cauchy’s Inequality rests not on the actual definition of 
the complex vector dot product, but rather on four of its most basic prop- 
erties. We use these properties to extend the concept of the complex vector 
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dot product to that of inner product. Later in this chapter we shall give 
several examples of inner products, applied to a variety of mathematical 
objects, including infinite sequences, functions, random variables, and ma- 
trices. For now, let us denote our mathematical objects by u and v and 
the inner product between them as lu, v}. The objects will then be said to 
be members of an inner-product space. We are interested in inner products 
because they provide a notion of orthogonality, which is fundamental to 
best approximation and optimal estimation. 


Defining an inner product: The four basic properties that will serve to 
define an inner product are: 


1. (u,u) > 0, with equality if and only if u = 


,u) = (u,v); 


v +w) S. (u, v) + (u, w); 


( 
iv 
3. (u 
4. (cu,v) = c(u,v) for any complex number c. 


The inner product is the basic ingredient in Hilbert space theory. Using the 
inner product, we define the norm of u to be 


lluļ| = v (u, u) 
and the distance between u and v to be ||u — v|]. 
The Cauchy—Schwarz Inequality: Because these four properties were 
all we needed to prove the Cauchy Inequality for the complex vector dot 


product, we obtain the same inequality whenever we have an inner product. 
This more general inequality is the Cauchy-Schwarz Inequality: 


(u,v)| < v (u, u) y (v, v) 


or 
(u,v)| < Ilall] llv]; 


with equality if and only if there is a scalar c such that v = cu. We say 
that the vectors u and v are orthogonal if (u,v) = 0. 


19.6 Another View of Orthogonality 


We can develop orthogonality and the Cauchy-Schwarz Inequality in 
another way. For simplicity, we assume that the inner product is defined 
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on a real vector space. From the definition of the norm we have 
IIx t+ yl? = (x+y, x+y) = |Ixl? + llyll? + 2(x,y). 
We say that Pythagoras’ Theorem holds for x Æ 0 and y Æ 0 if 
Ix + yl? = (x+y,x+y) = |x|? + llyll?. 
Clearly, Pythagoras’ Theorem holds if and only if (x, y} = 0. 
Now, we say that nonzero vectors x and y are orthogonal if 
lx + yl] = lx- yll. 


It is an easy exercise to show that x 4 0 and y 4 0 are orthogonal if and 
only if (x,y) = 0 and if and only if Pythagoras’ Theorem holds. 

For nonzero x and y, let p = yy be the vector in the span of y for 
which 


Ix — pl] < |x — yll, 
for all real 8. Minimizing the function 
f(8) = |x- By|l? 
with respect to the variable 8, we find that the optimal y is 
X, y 
Eo 
lly 


A simple calculation shows that the vectors x — p and p are orthogonal, so 
that, by Pythagoras’ Theorem, 


Ixl? = lx- pll? + [lpll’. 


It follows, therefore, that 
IIx] = Ilpll, 
and so 
Ixy) < Ixllilyll, 
with equality if and only if x = p. This is the Cauchy-Schwarz Inequality 
once again. 
For nonzero vectors in R? or R3 we know that 


x: y = |ixl|llyI| cos(9), 
where 0 is the angle between the two vectors when they are viewed as 
directed line segments placed so that they have a common starting point. 
Using the Cauchy-Schwarz Inequality, we can mimic what happens in R? 
and R? by defining the angle between nonzero vectors in an arbitrary inner 
product space to be 


(x,y) ) 
IIxIlllyll 
We turn now to some examples of inner products. 


0(x, y) = arccos ( 


Inner Products 297 


19.7 Examples of Inner Products 


In this section we illustrate the notion of inner product with several 
examples. 


19.7.1 An Inner Product for Infinite Sequences 


Let u = {un} and v = {vn} be infinite sequences of complex numbers. 
The inner product is then 


(u,v) = bas 
llall = So lun}. 


The sums are assumed to be finite; the index of summation n is singly or 
doubly infinite, depending on the context. The Cauchy-Schwarz Inequality 


says that 
do wet] < yY luni? Do lonl? 


19.7.2 An Inner Product for Functions 


and 


Now suppose that u = f(x) and v = g(x). Then the L? inner product 
(u,v) = f ead 
and the L? norm of u is 
lall = f Lee )Pae. 


The integrals are assumed to be finite; the limits of integration depend on 
the support of the functions involved. The Cauchy-Schwarz Inequality now 


says that 
|S ETa < yf f Eldey f Iolo Pee. 
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19.7.3 An Inner Product for Random Variables 
Now suppose that u = X and v = Y are random variables. Then, 
(u,v) = E(XY) 
and 
llull = VE X|?), 
which is the standard deviation of X if the mean of X is zero. The expected 
values are assumed to be finite. The Cauchy-Schwarz Inequality now says 


that 

JE(XY)| < VEX?) v E(IY|?). 
If E(X) =0 and E(Y) = 0, the random variables X and Y are orthogonal 
if and only if they are uncorrelated. 


19.7.4 An Inner Product for Complex Matrices 
Now suppose that u = A and v = B are complex matrices. Then, 


(u,v) = trace( BÝ A) 


||u|| = 4/trace(At A), 


where the trace of a square matrix is the sum of the entries on the main 
diagonal. This inner product is simply the complex vector dot product 
of the vectorized versions of the matrices involved. The Cauchy-Schwarz 
Inequality now says that 


|trace(Bt A)| < ,/trace(At A)4/ trace(Bt B). 


19.7.5 A Weighted Inner Product for Complex Vectors 


and 


Let u and v be complex vectors and let Q be a Hermitian positive- 
definite matrix; that is, Q' = Q and utQu > 0 for all nonzero vectors u. 
The Q-inner product is then 

(u,v) =v'Qu 
and the Q-norm of u is 


lull = Vatu. 


We know from the eigenvector decomposition of Q that Q = C'C for some 
matrix C. Therefore, the inner product is simply the complex vector dot 
product of the vectors Cu and Cv. The Cauchy-Schwarz Inequality says 


that 
lv'Qu| < VutQuyviQv. 


Inner Products 299 


19.7.6 A Weighted Inner Product for Functions 
Now suppose that u = f(x), v = g(x), and w(x) > 0. Then define 


(u, v) = f reu x)dz 


lull = fire )Pw(£)dz. 


and 


The integrals are assumed to be finite; the limits of integration depend on 
the support of the functions involved. T ae inner ae 5 simply the L? 


inner product of the functions f(x)y w(x) and g(x)y w(x). The Cauchy- 
Schwarz Inequality now says that 


| | Fes) w(e)da| < fire \[w(x xd) | \g(x)|?w(a 


Once we have an inner product defined, we can speak about orthogonal- 
ity and best approximation. Important in that regard is the orthogonality 
principle. 


19.8 The Orthogonality Principle 


Imagine that you are standing and looking down at the floor. The point 
B on the floor that is closest to N, the tip of your nose, is the unique 
point on the floor such that the vector from B to any other point A on the 
floor is perpendicular to the vector from N to B; that is, (BN, BA) = 
This is a simple illustration of the orthogonality principle. Whenever we 
have an inner product defined we can speak of orthogonality and apply the 
orthogonality principle to find best approximations. 


The orthogonality principle: Let u and vt, ...,v be members of an 
inner-product space. For all choices of scalars aj1,...,an, we can compute 
the distance from u to the member av! + ...ayv. Then, we minimize 
this distance over all choices of the scalars; let b1, ..., by be this best choice. 
The orthogonality principle tells us that the member u — (biv! +...by-v’) 
is orthogonal to the member (av! +... + anv’) — (biv! + ...byv%), that 
is, 


(u— (biv! +...bwv®), (aiv! +... Fayv’) — (biv! +...bwv) = 0, 
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for every choice of scalars an. We can then use the orthogonality principle 
to find the best choice b1.,,,.by. 

For each fixed index value j in the set {1,..., N}, let an = bn if j is not 
equal to n and a; = bj + 1. Then we have 


0= (u — (biv! + byv), vý), 


or 


for each j. The v” are known, so we can calculate the inner products 
(v”, vý) and solve this system of equations for the best bn. 

We shall encounter a number of particular cases of the orthogonality 
principle in subsequent chapters. The example of the least-squares solution 
of a system of linear equations provides a good example of the use of this 
principle. 


The least-squares solution: Let Va = u be a system of M linear equa- 
tions in N unknowns. For n = 1,...,N let v” be the nth column of the 
matrix V. For any choice of the vector a with entries an, n = 1, ..., N, the 


vector Va is 
N 
Va= > anv”. 
n=1 


Solving Va = u amounts to representing the vector u as a linear combina- 
tion of the columns of V. 

If there is no solution of Va = u then we can look for the best choice of 
coefficients so as to minimize the distance ||u — (aiv! +... + anv )||. The 
matrix with entries (v”, vi) is V'V, and the vector with entries (u, v’) is 
Vtu. According to the orthogonality principle, we must solve the system 
of equations Vtu = V'Va, which leads to the least-squares solution. 


Ex. 19.3 Find polynomial functions f(x), g(x) and h(x) that are orthog- 
onal in the sense of the L? inner product on the interval [0,1] and have 
the property that every polynomial of degree two or less can be written as 
a linear combination of these three functions. 


Ex. 19.4 Show that the functions e”, n an integer, are orthogonal in the 
sense of the L? inner product on the interval [—1,7]. Let f(x) have the 
Fourier expansion 


co 
f(z) = 5 ane”, |x| < r. 


n=— oo 


Use orthogonality to find the coefficients an. 
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We have seen that orthogonality can be used to determine the coeffi- 
cients in the Fourier series representation of a function. There are other 
useful representations in which orthogonality also plays a role; wavelets is 
one example. Let f(x) be defined on some closed interval [a,b]. Suppose 
that we change the function f(x) to a new function g(x) by altering the 
values for x within a small interval, keeping the remaining values the same: 
then all of the Fourier coefficients change. Looked at another way, a local- 
ized disturbance in the function f(a) affects all of its Fourier coefficients. 
It would be helpful to be able to represent f(x) as a sum of orthogonal 
functions in such a way that localized changes in f(x) affect only a small 
number of the components in the sum. One way to do this is with wavelets, 
as we saw in Chapter 18. 
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20.1 Chapter Summary 


The vector Wiener filter (VWF) is similar to the BLUE and provides 
another method for estimating the vector x given noisy measurements z in 
C7, where 

z= Hx-+v, 


with x and v independent random vectors and H a known matrix. We shall 
assume throughout this chapter that E(v) = 0 and let Q = E(vv'). 

When the data is a finite vector composed of signal plus noise the vec- 
tor Wiener filter can be used to estimate the signal component, provided 
we know something about the possible signals and possible noises. In the- 
oretical discussion of filtering signal from signal plus noise, it is traditional 
to assume that both components are doubly infinite sequences of random 
variables. In this case the Wiener filter is a convolution filter that operates 
on the input signal plus noise sequence to produce the output estimate of 
the signal-only sequence. The derivation of the Wiener filter is in terms 
of the autocorrelation sequences of the two components, as well as their 
respective power spectra. 
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20.2 The Vector Wiener Filter in Estimation 


It is common to formulate the VWF in the context of filtering a signal 
vector s from signal plus noise. The data is the vector 


Z=Sst+Vv, 


and we want to estimate s. Each entry of our estimate of the vector s will be 
a linear combination of the data values; that is, our estimate is § = Btz for 
some matrix B to be determined. This B will be called the vector Wiener 
filter. To extract the signal from the noise, we must know something about 
possible signals and possible noises. We consider several stages of increasing 
complexity and correspondence with reality. 


20.3 The Simplest Case 


Suppose, initially, and unrealistically, that all signals must have the form 
s = au, where a is an unknown scalar and u is a known vector. Suppose 
that all noises must have the form v = bw, where b is an unknown scalar 
and w is a known vector. Then, to estimate s, we must find a. So long as 
J > 2, we should be able to solve for a and b. We form the two equations 


uz = autu + bulw 


and 
w'z = aw'u + bwiw. 


This system of two equations in two unknowns will have a unique solu- 
tion unless u and w are proportional, in which case we cannot expect to 
distinguish signal from noise. 


20.4 A More General Case 


We move now to a somewhat more complicated, but still unrealistic, 
model. Suppose that all signals must have the form 


N 
s= > anu”, 
n=1 
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where the an are unknown scalars and the u” are known linearly indepen- 
dent vectors. Suppose that all noises must have the form 


M 
v= J bmw”, 
m=i 


where the bm are unknown scalars and w™ are known linearly independent 


vectors. Then, to estimate s, we must find the an. So long as J > N+M, 
we should be able to solve for the unique an and bm. However, we usually 
do not know a great deal about the signal and the noise, so it is better to 
assume that we are in the situation in which the N and M are large and 
J<N+M. 

Let U be the J by N matrix whose nth column is u” and W the J by 
M matrix whose mth column is w™. Let V be the J by N + M matrix 
whose first N columns contain U and whose last M columns contain W; 
so, V = [U W]. Let c be the N + M by 1 column vector whose first N 
entries are the a, and whose last M entries are the bm. We want to solve 
z= Vc. 

The system of linear equations z = Vc has too many unknowns when 
N + M > J, so we seek the minimum-norm solution. In closed form this 
solution is 

ê= Vİ(VVÝ) tz. 


The first N entries of ĉ are our estimates of the an. Once we have these, 
we estimate the signal itself by multiplying by the matrix U; that is, our 
estimate of s is 

8 = UUt(UUt + WWŻ) tz. 


The matrix VV' = (UU! + WW’) involves what we shall call the signal 
correlation matrix UU' and the noise correlation matrix WW", by analogy 
with the statistical terminology. 

Consider UUt. The matrix UU? is J by J and the (i,j) entry of UUt 
is given by 


N 
vu} = outa 
n=1 


The matrix ŁUUİ has for its entries the average, over all the n = 1,..., N, 
of the product of the ith and jth entries of the vectors u”. Therefore, ŁU Ut 
is statistical information about the signal; it tells us how these products 
look, on average, over all members of the family {u” }, the ensemble, to use 
the statistical word. 
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20.5 The Stochastic Case 


To pass to a more formal statistical framework, we let the coefficient 
vectors a = (a1,a2,... ay)? and b = (bj, b2, ...,bm)T be independent ran- 
dom white-noise vectors, both with mean zero and covariance matrices 
E(aat) = I and E(bb‘) = I. Now the matrices UU and WW" are defined 
statistically; 

UU' = E(ss') = Rs 
and 
Wwt = E(vv') =Q=R,. 


The estimate of s is the result of applying the vector Wiener filter to the 
vector z and is once again given by 


gs = UU (UUŻ + WWŻ) tz. 


Ex. 20.1 Apply the vector Wiener filter to the simplest problem discussed 
earlier in the chapter on the BLUE; let N = 1 and assume that c is a 
random variable with mean zero and variance one. It will help to use the 
matrix-inversion identity 


(Q + uut)! = Q7! — (1 + utg-tu) tQ taut Qt, (20.1) 


see also Equation (17.10). 


20.6 The VWF and the BLUE 


To apply the VWF to the problem considered in the discussion of the 
BLUE, let the vector s be Hx. We assume, in addition, that the vector x 
is a white-noise vector; that is, E(xx') = o?I. Then, Rs = o? HH". 

In the VWF approach we estimate s using 

8 = Bz, 


where the matrix B is chosen so as to minimize the mean squared error, 
E\|8 — s||?. This is equivalent to minimizing 


trace E((B'z — s)(Btz—s)'). 
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Expanding the matrix products and using the previous definitions, we see 
that we must minimize 


trace (BI (R, + R,)B — R,B— BÝ R, + Rs). 


Differentiating with respect to the matrix B using Equations (21.15) and 
(21.16), we find 
(Rs + R,)B- Rs = 0, 


so that 
B = (R; + Ro) Ra. 


Our estimate of the signal component is then 
8 = Rs(Rs + Ro) tz. 
With s = Hx, our estimate of s is 
8 = 0° HH (o0 HHÝ +Q)`'z, 
and the VWF estimate of x is 
% = 0° H' (o° HH! + Q)'z. 


How does this estimate relate to the one we got from the BLUE? 
The BLUE estimate of x is 


x = (HQH) HQ tz. 


From the matrix identity in Equation (17.4), we know that 


(H'QH + 0° IH Q! = o° H’ (o° H HÝ +Q). 


Therefore, the VWF estimate of x is 


x= (HİQT!H +0? I H Qz. 


Note that the BLUE estimate is unbiased and unaffected by changes in the 
signal strength or the noise strength. In contrast, the VWF is not unbiased 
and does depend on the signal-to-noise ratio; that is, it depends on the 
ratio o?/trace(Q). The BLUE estimate is the limiting case of the VWF 
estimate, as the signal-to-noise ratio goes to infinity. 

The BLUE estimates s = Hx by first finding the BLUE estimate of x 
and then multiplying it by H to get the estimate of the signal s. 


Ex. 20.2 Show that the mean-squared error in the estimation of s is 


E(||8 — ||") = trace (H(H'Q7'H)~'H"). 
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The VWF finds the linear estimate of s = Hx that minimizes the mean- 
squared error E(||8 — s||?). Consequently, the mean squared error in the 
VWE is less than that in the BLUE. 


Ex. 20.3 Assume that E(xx!) = oI. Show that the mean squared error 
for the VWF estimate is 


E(||8 — s||?) = trace (H(H'Q-1H + 0771)-1H"). 


20.7 Wiener Filtering of Functions 


The Wiener filter is often presented in the context of random functions 
of, say, time. In this model the signal is s(t) and the noise is q(t), where these 
functions of time are viewed as random functions (stochastic processes). 
The data is taken to be z(t), a function of t, so that the matrices UUt 
and WW? are now infinite matrices; the discrete index j = 1,..., J is now 
replaced by the continuous index variable t. Instead of the finite family 
{u”,n = 1..., N}, we now have an infinite family of functions u(t) in U. The 
entries of UU™ are essentially the average values of the products u(t1)u(t2) 
over all the members of U. It is often assumed that this average of products 
is a function not of tı and tz separately, but only of their difference tı — to; 
this is called stationarity. So, aver{u(t1)u(t2)} = rs(tı — t2) comes from a 
function r;(7) of a single variable. The Fourier transform of rs(T) is Rs(w), 
the signal power spectrum. The matrix UU? is then an infinite Toeplitz 
matrix, constant on each diagonal. The Wiener filtering can actually be 
achieved by taking Fourier transforms and multiplying and dividing by 
power spectra, instead of inverting infinite matrices. It is also common to 
discretize the time variable and to consider the Wiener filter operating on 
infinite sequences, as we see in the next section. 


20.8 Wiener Filter Approximation: The Discrete 
Stationary Case 
Suppose now that the discrete stationary random process to be filtered 


is the doubly infinite sequence {Zn = $n+n}9U_.,, where {sn } is the signal 
component with autocorrelation function rs(k) = E(sn+k5n) and power 
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spectrum R,(w) defined for w in the interval [—7, 7], and {qn} is the noise 
component with autocorrelation function rg(k) and power spectrum R,(w) 
defined for w in [—7,7]. We assume that for each n the random variables 
Sn and qn have mean zero and that the signal and noise are independent 
of one another. Then the autocorrelation function for the signal-plus-noise 
sequence {2n} is 
r-(n) = re(n) + r4(n) 
for all n and 
Rz(w) = Rs (w) + Ra) 


is the signal-plus-noise power spectrum. 
Let h = {hk} be a linear filter with transfer function 


co 
H(w) = 5 hrpe”, 
k=—0o 
for w in [—7, r]. Given the sequence {zn} as input to this filter, the output 
is the sequence 


Yn = 5 hkZn—k- (20.2) 


k=—0o 


The goal of Wiener filtering is to select the filter h so that the output se- 
quence yn approximates the signal sn sequence as well as possible. Specifi- 
cally, we seek h so as to minimize the expected squared error, E(|Yn—$n|7), 
which, because of stationarity, is independent of n. We have 


Ell?) = X he ( So CG- k) +r- k)) 
k=— 0 j=—oo 


II 

= 
coe 

3 
N 

* 

= 
YS 
co 


which, by the Parseval Equation (2.17), equals 
1 


Qn 


H(w)Rz(w)H(w)dw = A H(w)|?Rz(w)dw. 
27 


Similarly, 
E(8nUn) = Ss? hjrs(j), 
j=% 


which equals 
1 — 
FOLO 
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and 


E(|sn|*) = ae | Role). 


Therefore, 


E(\¥n — sal) = = | Hw) JR (w (w)dw -= f REO) 
eae (odot = f Rl) 


As we shall see shortly, minimizing E(|yYm—$,|”) with respect to the function 
H(w) leads to the equation 


R,(w)H(w) = Rw), 
so that the transfer function of the optimal filter is 


H(w) = R,(w)/Rz(w). 


The Wiener filter is then the sequence {hy} of the Fourier coefficients of 
this function H (w). 
To prove that this choice of H(w) minimizes E(|yn — sn|?), we note that 


|H(w)|?Rz(w) z Rs(w)H (w) m Rs(w)H (w) + Rs(w) 
= R,|H(w) — Rs(w)/R-(w)|? + Rs(w) — Rs(w)?/Rz(). 


Only the first term involves the function H(w). 


20.9 Approximating the Wiener Filter 


Since H(w) is a nonnegative function of w, therefore real-valued, its 
Fourier coefficients hy will be conjugate symmetric; that is, h_, = hy. This 
poses a problem when the random process Zz, is a discrete time series, with 
Zn denoting the measurement recorded at time n. From Equation (20.2) 
we see that to produce the output yn corresponding to time n we need the 
input for every time, past and future. To remedy this we can obtain the 
best causal approximation of the Wiener filter h. 

A filter g = {gx}?2_. is said to be causal if g, = 0 for k < 0; this 
means that given the input sequence {zn}, the output 
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requires only values of zm up to m = n. To obtain the causal filter g 
that best approximates the Wiener filter, we find the coefficients gẹ that 
minimize the quantity E(|yn — wn|*), or, equivalently, we minimize 


Tv 
2 
The orthogonality principle tells us that the optimal coefficients must sat- 
isfy the equations 


+00 2 
H(w)— 5 gxe'™| R,(w)dw. 
k=0 


+00 
rs(m) =) > gerz(m — k), 
k=0 


for all m. These are the Wiener—Hopf equations [122]. 

Even having a causal filter does not completely solve the problem, since 
we would have to record and store the infinite past. Instead, we can decide 
to use a filter f = {fx}, for which fke = 0 unless —K < k < L for 
some positive integers K and L. This means we must store L values and 
wait until time n + K to obtain the output for time n. Such a linear filter 
is a finite memory, finite delay filter, also called a finite impulse response 
(FIR) filter. Given the input sequence {zn} the output of the FIR filter is 


L 
Un = X Jkžn-k. 
k=-K 


To obtain such an FIR filter f that best approximates the Wiener filter, 
we find the coefficients fy that minimize the quantity E(|yn — vn|?), or, 
equivalently, we minimize 


T 
I, 
The orthogonality principle tells us that the optimal coefficients must sat- 
isfy the equations 


bE 2 
Hw)- Y paml R-(w)dw. (20.3) 
k=—K 


L 
rs(m) = X` ferz(m—k), (20.4) 
k=-—K 


for -K<m<L. 

In [31] it was pointed out that the linear equations that arise in Wiener- 
filter approximation also occur in image reconstruction from projections, 
with the image to be reconstructed playing the role of the power spectrum 
to be approximated. The methods of Wiener-filter approximation were then 
used to derive linear and nonlinear image-reconstruction procedures. 
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20.10 Adaptive Wiener Filters 


Once again, we consider a stationary random process Zn = Sn +Un with 
autocorrelation function E(znZn-m) = 1z(m) = rs(m) + ry(m). The finite 
causal Wiener filter (FCWF) f = (fo, f1,---, fz)’ is convolved with {zn} to 
produce an estimate of s, given by 


L 
Sn = X Then ke 
k=0 


With yl, = (zn, Zn—1, +; Z2n—L) we can write 8, = y} f. The FCWF f mini- 
mizes the expected squared error 


J(£) = E(|sn — nl”) 


and is obtained as the solution of the equations 


L 
rs(m) = X. ferz(m—k), 
k=0 


for 0 < m < L. Therefore, to use the FCWF we need the values r,(m) and 
rz(m—k) for m and k in the set {0,1,..., L}. When these autocorrelation 


values are not known, we can use adaptive methods to approximate the 
FCWFE. 


20.10.1 An Adaptive Least-Mean-Square Approach 


We assume now that we have Zo, 21,...,2N and po,pi,...,pN, Where pn 
is a prior estimate of sn, but that we do not know the correlation functions 
r, and rs. 

The gradient of the function J(f) is 


VJ(f) = Raf — rs, 


where R,, is the square matrix with entries r,(m—n) and rs is the vector 
with entries r,(m). An iterative gradient descent method for solving the 
system of equations R,,f =r, is 


f- — f-—ı = Ur VJ (£1), 


for some step-size parameters ur > 0. 
The adaptive least-mean-square (LMS) approach [48] replaces the gra- 
dient of J(f) with an approximation of the gradient of the function 
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G(f) = |sn — 8n|?, which is —2(sn — 8n)yn. Since we do not know sn, 
we replace that term with the estimate pn. The iterative step of the LMS 
method is 


f- = f-—ı + [Lr (Pr ragi yl f,_1)y7, (20.5) 


for L < T < N. Notice that it is the approximate gradient of the function 
|s7 —8,|? that is used at this step, in order to involve all the data zo, ..., ZN 
as we iterate from T = L to Tr = N. We illustrate the use of this method in 
adaptive interference cancellation. 


20.10.2 Adaptive Interference Cancellation (AIC) 


Adaptive interference cancellation (AIC) [161] is used to suppress a 
dominant noise component Vn in the discrete sequence Zn = Sn + Un. It is 
assumed that we have available a good estimate qn of vn. The main idea 
is to switch the roles of signal and noise in the adaptive LMS method and 
design a filter to estimate vn. Once we have that estimate, we subtract it 
from Zn to get our estimate of sn. 

In the role of zn we use 


Qn = Un + En, 


where €n denotes a low-level error component. In the role of ppn, we take 
Zn, Which is approximately vn, since the signal sn is much lower than the 
noise vn. Then, y} = (dn; Qn—1; ++) In—L)- The iterative step used to find 


n 


the filter f is then 
f- T f,_1 + [er (Zr z yl f,_1)y-7, 


for L <r < N. When the iterative process has converged to f, we take as 
our estimate of Sn 


L 
Sn = Zn — X frdn-k- 
k=0 


It has been suggested that this procedure be used in computerized tomog- 
raphy to correct artifacts due to patient motion [66]. 


20.10.3 Recursive Least Squares (RLS) 


An alternative to the LMS method is to find the least-squares solution 
of the system of N — L + 1 linear equations 


L 
Pn = X Tenkes 
k=0 
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for L <n < N. The recursive least squares (RLS) method is a recursive 
approach to solving this system. 

For L< rT < N let Z, be the matrix whose rows are yi, for n = L, ...,T, 
pÍ = (PL, PL+1, --, pr) and Q, = Zi Z+. The least-squares solution we seek 
is 

f = QR Z4 pv. 


Ex. 20.4 Show that Qr = Qr-1 + yry}, JorL< r <N. 


Ex. 20.5 Use the matriz-inversion identity in Equation (20.1) to write 
Q7+ in terms of Oss: 


Ex. 20.6 Using the previous exercise, show that the desired least-squares 
solution f is f = fy, where, for L <T <N we let 


— vyf 
Yrtr— at 
ee eae eae Ozh yr- 
1+yrQ7_1y7 


Comparing this iterative step with that given by Equation (20.5), we see 
that the former gives an explicit value for u, and uses On ye instead of y+ 
as the direction vector for the iterative step. The RMS iteration produces 
a more accurate estimate of the FCWF than does the LMS method, but 
requires more computation. 
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21.1 Chapter Summary 


Matrices and their algebraic properties play an ever-increasing role in 
signal processing. In this chapter we outline the most important of these 
properties. The notation associated with matrix and vector algebra is de- 
signed to reduce the number of things we have to think about as we perform 
our calculations. This notation can be extended to multi-variable calculus, 


as we also show in this chapter. 


315 


316 Signal Processing: A Mathematical Approach 


21.2 Matrix Inverses 


A square matrix A is said to have inverse AT! provided that 


AA“! =A7A=T, 


where J is the identity matrix. The 2 by 2 matrix A = É J has an 


a 1 | d | 
ad—be|-c a 

whenever the determinant of A, det(A) = ad—bc is not zero. More generally, 
associated with every complex square matrix is the complex number called 
its determinant, which is obtained from the entries of the matrix using 
formulas that can be found in any text on linear algebra. The significance of 
the determinant is that the matrix is invertible if and only if its determinant 
is not zero. This is of more theoretical than practical importance, since no 
computer can tell when a number is precisely zero. A matrix A that is not 
square cannot have an inverse, but does have a pseudo-inverse, which is 
found using the singular-value decomposition. 


inverse 


21.3 Basic Linear Algebra 


In this section we discuss systems of linear equations, Gaussian elimi- 
nation, and the notions of basic and non-basic variables. 


21.3.1 Bases and Dimension 


The notions of a basis and of linear independence are fundamental in 
linear algebra. Let V be a vector space. 


Definition 21.1 A collection of vectors {u!,...,u%} in V is linearly inde- 
pendent if there is no choice of scalars a1,...,an, not all zero, such that 


0= aiu! +... + anu’. 
Definition 21.2 The span of a collection of vectors {u!,...,u%} in V is 
the set of all vectors x that can be written as linear combinations of the u”; 
that is, for which there are scalars c,...,cn, such that 


t= cu! Sek Se cyu”. 
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Definition 21.3 A collection of vectors {w',...,w%} in V is called a span- 
ning set for a subspace S if the set S is their span. 


Definition 21.4 A collection of vectors {u',...,w%} in V is called a basis 
for a subspace S if the collection is linearly independent and S is their span. 


Definition 21.5 A collection of vectors {u',...,u%} in an inner product 
space V is called orthonormal if ||u”||2 = 1, for all n, and (u™,u") = 0, 


formén. 


Suppose that S' is a subspace of V, that {w!,...,w} is a spanning set 
for S, and {ut}, ..., u™ } is a linearly independent subset of S. Beginning with 
w', we augment the set {u!,...,u™} with wf if wt is not in the span of the 
u™ and the w" previously included. At the end of this process, we have 
a linearly independent spanning set, and therefore, a basis, for S (Why?). 
Similarly, beginning with w!, we remove wf from the set {w!,...,w™ } if w 
is a linear combination of the w*, k = 1,...,j — 1. In this way we obtain 
a linearly independent set that spans S, hence another basis for S. The 
following lemma will allow us to prove that all bases for a subspace S have 
the same number of elements. 


Lemma 21.1 Let W = {w',...,w%} be a spanning set for a subspace S in 
R7, and V = {v",...,u™} a linearly independent subset of S. Then M < N. 


Proof: Suppose that M > N. Let Bo = {w!,...,w%}. To obtain the set 
Bı, form the set C1 = {v}, wt, ..., w } and remove the first member of C1 
that is a linear combination of members of C that occur to its left in the 
listing; since vt has no members to its left, it is not removed. Since W is 
a spanning set, v! is a linear combination of the members of W, so that 
some member of W is a linear combination of v! and the members of W 
that precede it in the list; remove the first member of W for which this is 
true. 

We note that the set Bı is a spanning set for S and has N members. 
Having obtained the spanning set Bg, with N members and whose first k 
members are v*,...,v!, we form the set Cyi1 = Bp U {v*t+}, listing the 
members so that the first k+1 of them are {v**!, v*,...,u!}. To get the set 
Bri we remove the first member of Ck+1 that is a linear combination of 
the members to its left; there must be one, since By, is a spanning set, and 
so vřt! is a linear combination of the members of Bp. Since the set V is 
linearly independent, the member removed is from the set W. Continuing 
in this fashion, we obtain a sequence of spanning sets B,,..., By, each with 
N members. The set By is By = {v!,...,.vN} and vt! must then be 
a linear combination of the members of By, which contradicts the linear 
independence of V. i 
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Corollary 21.1 Every basis for a subspace S has the same number of el- 
ements. 


Ex. 21.1 Let W = {w!,...,w%} be a spanning set for a subspace S in 
R7, and V = {v1,...,u™} a linearly independent subset of S. Let A be the 
matrix whose columns are the v™, B the matrix whose columns are the w”. 
Show that there is an N by M matrix C such that A= BC. Prove Lemma 
21.1 by showing that, if M > N, then there is a non-zero vector x with 
Cx = Ax = 0. 


Definition 21.6 The dimension of a subspace S is the number of elements 
in any basis. 


Lemma 21.2 For any matrix A, the mazimum number of linearly inde- 
pendent rows equals the maximum number of linearly independent columns. 


Proof: Suppose that A is an I by J matrix, and that K < J is the 
maximum number of linearly independent columns of A. Select K linearly 
independent columns of A and use them as the K columns of an I by K 
matrix U. Since every column of A must be a linear combination of these 
K selected ones, there is a K by J matrix M such that A = UM. From 
AT = MTU? we conclude that every column of A” is a linear combination 
of the K columns of the matrix MT. Therefore, there can be at most K 
linearly independent columns of A’. | 


Definition 21.7 The rank of A is the mazimum number of linearly inde- 
pendent rows or of linearly independent columns of A. 


21.3.2 Systems of Linear Equations 


Consider the system of three linear equations in five unknowns given by 


£1 + 2% + 2£4 + z5 =0 


—z1ı — £2 + T3 +14 =0 


£1 + 2x2 — 3£3 — T4 — 245 = 0. 


This system can be written in matrix form as Ax = 0, with A the coefficient 
matrix 


1 2 0 2 1 
A=|1 = 1 1 0l, 
1 2 -3 -1 -2 
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and x = (#1, £2, £3, 4,25)’. Applying Gaussian elimination to this system, 
we obtain a second, simpler, system with the same solutions: 
a, — 2x4 + z5 = 0 
£2 +2274 = 0 


£3 + z4 + z5 = 0. 


From this simpler system we see that the variables x4 and x5 can be freely 
chosen, with the other three variables then determined by this system of 
equations. The variables x4 and x5 are then independent, the others de- 
pendent. The variables 71,72 and x3 are then called basic variables. To 
obtain a basis of solutions we can let 74 = 1 and z5 = 0, obtaining the 
solution x = (2,—2,—1,1,0)", and then choose z4 = 0 and x5 = 1 to 
get the solution z = (—1,0,—1,0,1)”. Every solution to Ax = 0 is then a 
linear combination of these two solutions. Notice that which variables are 
basic and which are non-basic is somewhat arbitrary, in that we could have 
chosen as the non-basic variables any two whose columns are independent. 

Having decided that x4 and x5 are the non-basic variables, we can write 
the original matrix A as A = [B N |; where B is the square invertible 
matrix 


and N is the matrix 


N=j)1 0 
=] 2 


With xp = (x1, £2, £3)! and xy = (x4,25)" we can write 
Ac = Brg + Nay =0, 


so that 
TB = —B'Nen. 


21.3.3 Real and Complex Systems of Linear Equations 


A system Ax = b of linear equations is called a complex system, or a 
real system if the entries of A, x and b are complex, or real, respectively. 
For any matrix A, we denote by AT and A? the transpose and conjugate 
transpose of A, respectively. 

Any complex system can be converted to a real system in the following 
way. A complex matrix A can be written as A = A; + 7A, where A; and 
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Ag are real matrices and i = /—I. Similarly, x = zt + ix? and b = bt +ib?, 
where x1, 2”, b! and b? are real vectors. Denote by A the real matrix 


by x the real vector 


and by b the real vector 


Then g satisfies the system Ax = b if and only if & satisfies the system 
Az =b. 

Definition 21.8 A square matriz A is symmetric if AT = A and Hermi- 
tian if At = A. 

Definition 21.9 A non-zero vector x is said to be an eigenvector of the 


square matrix A if there is a scalar A such that Ax = Ax. Then X is said 
to be an eigenvalue of A. 


If x is an eigenvector of A with eigenvalue A, then the matrix A — AI 
has no inverse, so its determinant is zero; here I is the identity matrix with 
ones on the main diagonal and zeros elsewhere. Solving for the roots of the 
determinant is one way to calculate the eigenvalues of A. For example, the 
eigenvalues of the Hermitian matrix 


1 2+3 
meh 


are AÀA = 14+ v5 and à = 1-— V5, with corresponding eigenvectors 
u = (v5,2 — i)? and v = (v5,i — 2)", respectively. Then B has the 
same eigenvalues, but both with multiplicity two. Finally, the associated 
eigenvectors of B are 


and 
for A = 1 + v5, and 
and 


for A = 1 — v5. 
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21.4 Solutions of Under-determined Systems of Linear 
Equations 


Suppose that Ax = b is a consistent linear system of M equations in 
N unknowns, where M < N. Then there are infinitely many solutions. 
A standard procedure in such cases is to find that solution x having the 
smallest norm 


As we shall see shortly, the minimum-norm solution of Ax = b is a vector of 
the form x = Atz, where At denotes the conjugate transpose of the matrix 
A. Then Ax = b becomes AAtz = b. Typically, (AA‘)~! will exist, and we 
get z = (AA')~'b, from which it follows that the minimum-norm solution 
is x = Aİ (AAt) tb. When M and N are not too large, forming the matrix 
AAt and solving for z is not prohibitively expensive and time-consuming. 
However, in image processing the vector x is often a vectorization of a two- 
dimensional (or even three-dimensional) image and M and N can be on 
the order of tens of thousands or more. The ART algorithm gives us a fast 
method for finding the minimum-norm solution without computing AA; 
see [84] and [42]. 

We begin by proving that the minimum-norm solution of Ax = b has 
the form x = Atz for some M-dimensional complex vector Z. 

Let the null space of the matrix A be all N-dimensional complex vectors 
w with Aw = 0. If Ax = b then A(x + w) = b for all w in the null space 
of A. If x = Atz and w is in the null space of A, then 


IIx+wl2 = |[Atz+ w]? = (Atz +w) (Az + w) 
= (Az) (Atz) + (Atz) w + wt (Atz) +w'w 
= ||A‘z||? + (Atz) w + wt (Atz) + |w]? 
IlAtz]]? + |w]? 


since 
wi (Atz) = (Aw)'z = 0'z = 0 
and 
(Atz)'w =z! Aw = 210 = 0. 
Therefore, ||x + w|| = || Atz + w|| > ||Atz|| = ||x|| unless w = 0. 


Ex. 21.2 Show that if z = (z,...,zn)’ is a column vector with complex 
entries and H = H' is an N by N Hermitian matriz with complex entries 
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then the quadratic form z!Hz is a real number. Show that the quadratic 
form z' Hz can be calculated using only real numbers. Let z =x + iy, with 
x andy real vectors and let H = A+iB, where A and B are real matrices. 
Then show that AT = A, BT = —B, xT Bx = 0 and finally, 


sme lh “AE 


Use the fact that zt Hz is real for every vector z to conclude that the eigen- 
values of H are real. 


21.5 Eigenvalues and Eigenvectors 


Given N by N complex matrix A, we say that a complex number A is an 
eigenvalue of A if there is a nonzero vector u with Au = Au. The column 
vector u is then called an eigenvector of A associated with eigenvalue A; 
clearly, if u is an eigenvector of A, then so is cu, for any constant c Æ 0. 
If A is an eigenvalue of A, then the matrix A — XJ fails to have an inverse, 
since (A — AJ)u = 0 but u Æ 0. If we treat À as a variable and compute 
the determinant of A — AI, we obtain a polynomial of degree N in A. Its 
roots \1,...,AN are then the eigenvalues of A. If ||ul|? = uu = 1 then 
u Au = Autu = À. 

It can be shown that it is possible to find a set of N mutually orthogonal 
eigenvectors of the Hermitian matrix H; call them {u',..., uw}. The matrix 
H can then be written as 


N 
n=1 


a linear superposition of the dyad matrices u"(u”)!. We can also write H = 
ULUt, where U is the matrix whose nth column is the column vector u” 
and L is the diagonal matrix with the eigenvalues down the main diagonal 
and zero elsewhere. 

The matrix H is invertible if and only if none of the À are zero and its 
inverse is 


N 
hie 5 Aztu” (u”). 
n=1 


We also have Ht = UL~!Ut. 
A Hermitian matrix Q is said to be nonnegative definite (positive defi- 
nite) if all the eigenvalues of Q are nonnegative (positive). The matrix Q is 
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a nonnegative-definite matrix if and only if there is another matrix C such 
that Q = C'C. Since the eigenvalues of Q are nonnegative, the diagonal 
matrix L has a square root, VL. Using the fact that UTU = I, we have 


Q =ULUŻ = UV LUUV LU; 


we then take C = UVLUt, so Ct = C. Then zîQz = z'CtCz = ||Cz]||?, so 
that Q is positive definite if and only if C is invertible. 


Ex. 21.3 Let A be an M by N matrix with complex entries. View A as a 
linear function with domain C, the space of all N-dimensional complex 
column vectors, and range contained within C™ , via the expression A(x) = 
Ax. Suppose that M > N. The range of A, denoted R(A), cannot be all of 
C™. Show that every vector z in C™ can be written uniquely in the form 
z= Ax+w, where Aw = 0. Show that ||z||? = || Ax||? + ||w||?, where ||z||? 
denotes the square of the norm of z. Hint: If z = Ax +w then consider 
Atz. Assume A‘ A is invertible. 


21.6 Vectorization of a Matrix 


When the complex M by N matrix A is stored in the computer it is 
usually vectorized; that is, the matrix 


Au Ajo ars Ain 

Ao Ao ave’ Aon 
A = 

Amı Amo ... Aun 


becomes 
vec(A) = (Au, A21, EEE Amı, A12, Aaa, <- ÅM2, . AMN)”. 


Ex. 21.4 (a) Show that the complex dot product vec(A)-vec(B) = 
vec(B)'vec(A) can be obtained by 


vec(A): vec(B) = trace (AB') = tr(AB"), 


where, for a square matrix C, trace(C) means the sum of the entries along 
the main diagonal of C. We can therefore use the trace to define an inner 
product between matrices: < A, B >= trace (AB'). 
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(b) Show that trace (AAt) > 0 for all A, so that we can use the trace to 
define a norm on matrices: ||A||? = trace (AAt). 


Ex. 21.5 Let B=ULD! be an M by N matriz in diagonalized form; that 
is, L is an M by N diagonal matrix with entries \1,...,A\K on its main 
diagonal, where K = min(M,N), and U and V are square matrices. Let 
the n-th column of U be denoted u” and similarly for the columns of V. 


Such a diagonal decomposition occurs in the singular value decomposition 
(SVD). Show that we can write 


B=) ul (vt) +... + Agu” (vě j. 


If B is an N by N Hermitian matrix, then we can take U = V and K = 
M = N, with the columns of U the eigenvectors of B, normalized to have 
Euclidean norm equal to one, and the An to be the eigenvalues of B. In 
this case we may also assume that U is a unitary matrix; that is, VUUT = 
UtU =I, where J denotes the identity matrix. 


21.7 The Singular Value Decomposition of a Matrix 


We have just seen that an N by N Hermitian matrix H can be written 
in terms of its eigenvalues and eigenvectors as H = U LUt or as 


N 
H= 5 Anu” (u). 
n=1 


The singular value decomposition (SVD) is a similar result that applies to 
any rectangular matrix. It is an important tool in image compression and 
pseudo-inversion. 


21.7.1 The SVD 


Let C be any N by K complex matrix. In presenting the SVD of C we 
shall assume that K > N; the SVD of Ct will come from that of C. Let 
A=C'C and B = CC"; we assume, reasonably, that B, the smaller of the 
two matrices, is invertible, so all the eigenvalues 1, ..., Ay of B are positive. 
Then, write the eigenvalue/eigenvector decomposition of B as B = U LU. 


Ex. 21.6 Show that the nonzero eigenvalues of A and B are the same. 
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Let V be the K by K matrix whose first N columns are those of the 
matrix C'U L7"? and whose remaining K — N columns are any mutually 
orthogonal norm-one vectors that are all orthogonal to each of the first 
N columns. Let M be the N by K matrix with diagonal entries Mnn = 
VXn for n = 1,...,N and whose remaining entries are zero. The nonzero 
entries of M, VAn, are called the singular values of C. The singular value 
decomposition (SVD) of C is C = UM V+. The SVD of Ct is Ct = VM7Ut. 


Ex. 21.7 Show that UMVi =C. 


Using the SVD of C we can write 
N 
C=Ņ Vanav", 
n=1 


where v” denotes the nth column of the matrix V. 

In image processing, matrices such as C are used to represent discrete 
two-dimensional images, with the entries of C corresponding to the grey 
level or color at each pixel. It is common to find that most of the N singular 
values of C are nearly zero, so that C can be written approximately as a 
sum of far fewer than N dyads; this is SVD image compression. 


21.7.2 An Application in Space Exploration 


The Galileo was deployed from the space shuttle Atlantis on October 18, 
1989. After a detour around Venus and back past Earth to pick up gravity- 
assisted speed, Galileo headed for Jupiter. Its mission included a study of 
Jupiter’s moon Europa, and the plan was to send back one high-resolution 
photo per minute, at a rate of 134 KB per second, via a huge high-gain 
antenna, one with a high degree of directionality that can transmit most of 
the limited signal energy in a narrow beam. When the time came to open 
the antenna, it stuck. Without the pictures, the mission would be a failure. 

There was a much smaller low-gain antenna on board, but the best 
transmission rate was going to be ten bits per second, and the directionality 
was much less. All that could be done from earth was to reprogram an old 
on-board computer to compress the pictures prior to transmission. The 
problem was that pictures could be taken much faster than they could be 
transmitted to earth; some way to store them prior to transmission was 
key. The original designers of the software had long since retired, but the 
engineers figured out a way to introduce state-of-the-art image compression 
algorithms into the computer. It happened that there was an ancient reel- 
to-reel storage device on board that was there only to serve as a backup for 
storing atmospheric data. Using this device and the compression methods, 
the engineers saved the mission [5]. 
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21.7.3 Pseudo-Inversion 


If N Æ K then C cannot have an inverse; it does, however, have a 
pseudo-inverse, C* = VM*U', where M* is the matrix obtained from M 
by taking the inverse of each of its nonzero entries and leaving the remaining 
zeros the same. The pseudo-inverse of Ct is 


(Cty = (C*) = U(M*) VE = U(MŻ)* Vi. 

Some important properties of the pseudo-inverse are the following: 

1. CO*C=C, 

2. C*CC* = C*, 

(CFOS Ore, 

4 (OOS CG, 
The pseudo-inverse of an arbitrary J by J matrix G can be used in much 
the same way as the inverse of nonsingular matrices to find approximate 


or exact solutions of systems of equations Gx = d. The following examples 
illustrate this point. 


Ex. 21.8 If I > J the system Gx = d probably has no exact solution. 
Show that whenever G'G is invertible the pseudo-inverse of G is G* = 
(G'G)-!G" so that the vector x = G*d is the least-squares approximate 
solution. 


Ex. 21.9 If I < J the system Gx = d probably has infinitely many solu- 
tions. Show that whenever the matriz GG" is invertible the pseudo-inverse 
of G is G* = G'(GG")—!, so that the vector x = G*d is the exact solution 
of Gx = d closest to the origin; that is, it is the minimum-norm solution. 


21.8 Singular Values of Sparse Matrices 


In image reconstruction from projections the M by N matrix A is usu- 
ally quite large and often e-sparse; that is, most of its elements do not 
exceed e in absolute value, where e denotes a small positive quantity. In 
transmission tomography each column of A corresponds to a single pixel 
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in the digitized image, while each row of A corresponds to a line segment 
through the object, along which an x-ray beam has traveled. The entries 
of a given row of A are nonzero only for those columns whose associated 
pixel lies on that line segment; clearly, most of the entries of any given row 
of A will then be zero. In emission tomography the I by J nonnegative 
matrix P has entries P,; > 0; for each detector i and pixel j, Pi; is the 
probability that an emission at the jth pixel will be detected at the ith 
detector. When a detection is recorded at the ith detector, we want the 
likely source of the emission to be one of only a small number of pixels. For 
single photon emission tomography (SPECT), a lead collimator is used to 
permit detection of only those photons approaching the detector straight 
on. In positron emission tomography (PET), coincidence detection serves 
much the same purpose. In both cases the probabilities P;; will be zero 
(or nearly zero) for most combinations of i and j. Such matrices are called 
sparse (or almost sparse). We discuss now a convenient estimate for the 
largest singular value of an almost sparse matrix A, which, for notational 
convenience only, we take to be real. 

In [40] it was shown that if A is normalized so that each row has length 
one, then the spectral radius of AT A, which is the square of the largest 
singular value of A itself, does not exceed the maximum number of nonzero 
elements in any column of A. A similar upper bound on p(AT A) can be 
obtained for non-normalized, e-sparse A. 

Let A be an M by N matrix. For each n = 1,..., N, let sn > 0 be 
the number of nonzero entries in the nth column of A, and let s be the 
maximum of the sn. Let G be the M by N matrix with entries 


N 1/2 
Gmn = Amn/ (>. nd) 2 


l=1 


Lent has shown that the eigenvalues of the matrix GTG do not exceed 
one [107]. This result suggested the following proposition, whose proof was 
given in [40]. 


Proposition 21.1 Let A be an M by N matrix. For each m = 1,...,M let 
i= DAR A? n > 0. For each n = 1,..., N let on = D €mnUm, where 
emn = 1 if Amn #0 and emn = 0 otherwise. Let o denote the maximum 
of the on. Then the eigenvalues of the matrix ATA do not exceed o. If A 
is normalized so that the Euclidean length of each of its rows is one, then 
the eigenvalues of ATA do not exceed s, the maximum number of nonzero 
elements in any column of A. 


Proof: For simplicity, we consider only the normalized case; the proof for 
the more general case is similar. 
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Let AT Av = cv for some nonzero vector v. We show that c < s. We have 
AAT Av = cAv and so wT AA? w = vT AT AAT Av = cv" AT Av = cw' w, 
for w = Av. Then, with emn = 1 if Amn 4 0 and emn = 0 otherwise, we 


have 
M A M 2 
(£ Annn] aa (2 Anat 
m=1 m= 
M M 
< (£ At È ia) 
m= m=1 
M M 
s o Aa 5 (£ At 3 
m= m=1 
Therefore, 
N M 2 N M 
wraatw= 3° ($ Amtn] SY (D Aath) e 
n=1 \m=1 n=1 \m=1 
and 
M M N 
w  AA™w = c 5 w =c 5 we, (>: fn) 
m=1 m=1 n=1 
M N 
= ¢ 5 w2, AZn- 
marn 
The result follows immediately. | 


If we normalize A so that its rows have length one, then the trace of the 
matrix AA” is tr(AA’) = M, which is also the sum of the eigenvalues of 
AT A. Consequently, the maximum eigenvalue of AT A does not exceed M; 
Proposition 21.1 improves that upper bound considerably, if A is sparse 
and sos << M. 

In image reconstruction from projection data that includes scattering we 
often encounter matrices A most of whose entries are small, if not exactly 
zero. A slight modification of the proof provides us with a useful upper 
bound for L, the largest eigenvalue of AT A, in such cases. Assume that the 
rows of A have length one. For e > 0 let s be the largest number of entries 
in any column of A whose magnitudes exceed e. Then we have 


L<s+MNe 4+ 2e(MNs)/?, 


The proof of this result is similar to that for Proposition 21.1. 
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21.9 Matrix and Vector Differentiation 


As we saw previously, the least-squares approximate solution of Ax = b 
is a vector $ that minimizes the function ||Ax — b||. In our discussion of 
band-limited extrapolation we showed that, for any nonnegative-definite 
matrix Q, the vector having norm one that maximizes the quadratic form 
x'@Qx is an eigenvector of Q associated with the largest eigenvalue. In 
the chapter on best linear unbiased optimization we seek a matrix that 
minimizes a certain function. All of these examples involve what we can 
call matriz-vector differentiation, that is, the differentiation of a function 
with respect to a matrix or a vector. The gradient of a function of several 
variables is a well-known example and we begin there. Since there is some 
possibility of confusion, we adopt the notational convention that boldfaced 
symbols, such as x, indicate a column vector, while x denotes a scalar. 


21.10 Differentiation with Respect to a Vector 


Let x = (a1,...,2~)" be an N-dimensional real column vector. Let 
z = f(x) be a real-valued function of the entries of x. The derivative of z 
with respect to x, also called the gradient of z, is the column vector 


Oz T 
— =a = (@,...,a 
Dx (ai N) 
with entries 
Oz 
= ; 
O Itn 


Ex. 21.10 Lety be a fixed real column vector and z = f(x) = y?x. Show 
that 
Oz 
ox 
Ex. 21.11 Let Q be a real symmetric nonnegative-definite matrix, and let 
z= f(x) =x’ Qx. Show that the gradient of this quadratic form is 


Oz 


Hint: Write Q as a linear combination of dyads involving the eigenvectors. 
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Ex. 21.12 Let z = || Ax — b||?. Show that 


— = 2A" Ax — 2A" b. 
Ox 


Hint: Use z = (Ax — b)! (Ax — b). 


We can also consider the second derivative of z = f(x), which is the 
Hessian matrix of z 


Oz 3 
H = ae = V‘ f(x) 
with entries 22 
z 
Hmn r oe 2 
OLmOLn 


If the entries of the vector z = (21, ..., zm)” are real-valued functions of 


the vector x, the derivative of z is the matrix whose mth column is the 
derivative of the real-valued function zm. This matrix is usually called the 
Jacobian matrix of z. If M = N the determinant of the Jacobian matrix is 
the Jacobian. 


Ex. 21.13 Suppose (u,v) = (u(x, y), v(@,y)) is a change of variables from 
the Cartesian (x,y) coordinate system to some other (u,v) coordinate sys- 
tem. Let x = (x,y)! and z = (u(x), u(x))”. 

(a) Calculate the Jacobian for the rectangular coordinate system ob- 
tained by rotating the (x,y) system through an angle of 0. 

(b) Calculate the Jacobian for the transformation from the (x, y) system 
to polar coordinates. 


21.11 Differentiation with Respect to a Matrix 


Now we consider real-valued functions z = f(A) of a real matrix A. As 
an example, for square matrices A we have 


z = f(A) = trace (A) = 5 Ann, 


the sum of the entries along the main diagonal of A. 


Matrix Theory 331 


The derivative of z = f(A) is the matrix 


Oz 
ices B 
OA 
whose entries are 
B= Oz 
no, OAmn j 
Ex. 21.14 Show that the derivative of trace (A) is B = I, the identity 


matriz. 


Ex. 21.15 Show that the derivative of z = trace (DAC) with respect to A 
is 


Consider the function f defined for all J by J positive-definite symmet- 
ric matrices by 


f(Q) = — log det (Q). 
Proposition 21.2 The gradient of f(Q) is g(Q) = Q7}. 


Proof: Let AQ be symmetric. Let y;, for j = 1, 2, ..., J, be the eigenvalues 
of the symmetric matrix Q~!/?(AQ)Q~!/?. These yj are then real and are 
also the eigenvalues of the matrix Q~ (AQ). We shall consider || AQ]| small, 
so we may safely assume that 1+ y; > 0. 

Note that 


(Q7', AQ) -$ Yi 
since the trace of any square matrix is the sum of its eigenvalues. Then we 


have 


F(Q + AQ) — F(Q) = — log det(Q + AQ) + log det (Q) 
—log det(I + Q7! =- Xost +a) 


From the submultiplicativity of the Frobenius norm we have 


IRQAQ) < AQI < IRAANI]. 


Therefore, taking the limit as ||AQ]| goes to zero is equivalent to taking 
the limit as ||y|| goes to zero, where y is the vector whose entries are the 


Yje 
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To show that g(Q) = —Q7! note that 


F(Q + AQ) - F(Q) - (-Q7*, AQ) 


lim sup 


Nel |AQ|| 
= -1 
a jip EEAO EAD PORO ne Ml 
IAQII=>0 |AQ|| 
. D | log(1 + yj) — 7l 
< limsup ——__——, 
\Iyl| +0 Ivl QII 
J 
= Sa ice log(1 + qj) 
< |Q! lim 4—1 0, 
IQI 2 A 


We note in passing that the derivative of det(DAC) with respect to A 
is the matrix det(DAC)(A~1)?. 


Although the trace is not independent of the order of the matrices in a 
product, it is independent of cyclic permutation of the factors: 


trace (ABC) = trace (CAB) = trace (BCA). 


Therefore, the trace is independent of the order for the product of two 
matrices: 


trace (AB) = trace (BA). 
From this fact we conclude that 
x’ x = trace (x! x) = trace (xx’). 
If x is a random vector with correlation matrix 
R= E(xx’), 
then 


E(x! x) = E(trace (xx’)) = trace (E(xx" )) = trace (R). 


Ex. 21.16 Let z = trace (ATCA). Show that the derivative of z with re- 
spect to the matrix A is 


Therefore, if C = Q is symmetric, then the derivative is 2QA. 
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We have restricted the discussion here to real matrices and vectors. It 
often happens that we want to optimize a real quantity with respect to a 
complex vector. We can rewrite such quantities in terms of the real and 
imaginary parts of the complex values involved, to reduce everything to 
the real case just considered. For example, let Q be a Hermitian matrix; 
then the quadratic form k' Qk is real, for any complex vector k. As we saw 
in Exercise 21.2, we can write the quadratic form entirely in terms of real 
matrices and vectors. 

If w = u + iv is a complex number with real part u and imaginary part 
v, the function z = f(w) = |w|? is real-valued. The derivative of z = f(w) 
with respect to the complex variable w does not exist. When we write 
z=u?+v?, we consider z as a function of the real vector x = (u, v)”. The 
derivative of z with respect to x is the vector (2u,2v)?. 

Similarly, when we consider the real quadratic form k'Qk, we view 
each of the complex entries of the N by 1 vector k as two real numbers 
forming a two-dimensional real vector. We then differentiate the quadratic 
form with respect to the 2N by 1 real vector formed from these real and 
imaginary parts. If we turn the resulting 2N by 1 real vector back into an 
N by 1 complex vector, we get 2Qk as the derivative; so, it appears as if 
the formula for differentiating in the real case carries over to the complex 
case. 


21.12 Eigenvectors and Optimization 


We can use these results concerning differentiation with respect to a 
vector to show that eigenvectors solve certain optimization problems. 
Consider the problem of maximizing the quadratic form x! Qx, subject 
to x'x = 1; here the matrix Q is Hermitian, positive-definite, so that all of 
its eigenvalues are positive. We use the Lagrange-multiplier approach, with 
the Lagrangian 
L(x, à) = x'Qx — Ax'x, 


where the scalar variable A is the Lagrange multiplier. We differentiate 
L(x, A) with respect to x and set the result equal to zero, obtaining 


2Qx — 2x = 0, 


or 


Qx = Xx. 


Therefore, x is an eigenvector of Q and 4 is its eigenvalue. Since 


x'Qx = Ax!x =), 
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we conclude that À = \1, the largest eigenvalue of Q, and x = u', a norm- 
one eigenvector associated with A1. 

Now consider the problem of maximizing x'Qx, subject to x'x = 1, 
and x'u! = 0. The Lagrangian is now 


L(x, \,a) = x'Qx — Ax'x — axtul. 


Differentiating with respect to the vector x and setting the result equal to 
zero, we find that 
2Qx — 2Ax — au! = 0, 


or 
Qx = Ax + Bu’, 


for 8 = a/2. But, we know that 
(ut) Qx = Alu’) x + B(u')lut = 8, 


and 
(ut) Qx = (Qu')'x = di(u')'x = 0, 


so 6 = 0 and we have 
Qx = Ax. 


Since 
x'Qx =A, 


we conclude that x is a norm-one eigenvector of Q associated with the 
second-largest eigenvalue, À = A2. 

Continuing in this fashion, we can show that the norm-one eigenvector 
of Q associated with the nth largest eigenvalue An maximizes the quadratic 
form x'Qx, subject to the constraints xx = 1 and xtu™ = 0, for m = 
1,2,..,n—1. 


Chapter 22 


Compressed Sensing 


22.1 Chapter Summary .............. cece cee cee ence e cece cnet ee neees 335 
2202 ~ Wik, ONGRVICW? so ied hich fon op E EE N S a EAE A mands 336 
22.3 Compressed Sensing ............. cece eee cece eee ence nee e ees 337 
22.4 Sparse Solutions ............ 00. cece cece eee ence eee cnet eeteees 338 
22.4.1 Maximally Sparse Solutions .................. 0. eee ee eee 339 
22.4.2 Minimum One-Norm Solutions ...................e eee ee 341 
22.4.3 Minimum One-Norm as an LP Problem ................ 341 
22.4.4 Why the One-Norm? ........... 0. cee eee eee e eee eeee eee 342 
22.4.5 Comparison with the PDFT ....................... eee 342 
22.4.6 Iterative Reweighting ............. 0. cece cece eee eee 343 
22:5. « WHY Sparsenessitns aeai aa e E aE NEEE EE AE EANA bs 344 
22.5.1 Signal Analysis .......... cece eee eee ee eee EEUE OEN 344 
22.5.2 Locally Constant Signals ............. 2c cece eee eee eee 345 
22.5.3 Tomographic Imaging ............. 00. cece cece eee eee 346 
22.6 Compressed Sampling ............. 0. . cece eee e cece e cece cnet ees 346 


22.1 Chapter Summary 


Large amounts of data are often redundant and methods for compress- 
ing these data sets play an increasingly important role in a number of 
applications. The basic idea is to find ways to expand the data vector as 
a superposition of known vectors, so that only a few of the coefficients are 
nonzero. Much of the research in this field goes under the names compressed 
sensing and compressed sampling (CS) [67]. The key notion in CS is sparse- 
ness. The JPEG technology uses such an approach to represent images as 
a superposition of sinusoids and wavelets. For applications such as medical 
imaging, CS provides a means of reducing radiation dosage to the patient 
without sacrificing image quality. An important aspect of CS is finding 
sparse solutions of underdetermined systems of linear equations, which can 
often be accomplished by one-norm minimization. The best reference on 
CS to date is probably [16]. 


335 
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22.2 An Overview 


In this section we “compress” Justin Romberg’s article [133] that cap- 
tures well the essence of compressed sensing and compressed sampling. 

In classical data compression, the data vector is first transformed into 
a superposition of known basis “signals.” If this basis is well chosen, then 
most of the information in the data will be concentrated in a few terms 
with relatively large coefficients; the representation of the data is then 
said to be sparse with respect to the chosen basis. The data compression is 
then achieved by discarding the terms with relatively small coefficients. For 
example, vectorized digital photographs often have a sparse representation 
with respect to a wavelet basis that measures intensity at different scales. 
In this traditional approach, a great deal of data is obtained by sampling 
at a very high rate, and then applying a transform and selection process 
to produce a much smaller vector of important coefficients. This procedure 
of gathering a large amount of data just to produce a much smaller vector 
of coefficients, seems wasteful. Compressed sensing attempts to avoid this 
wastefulness by integrating the compression step into the sampling process 
itself. 

Normally, sampling means recording the values of an analog signal at 
some discrete set of points. Instead, CS devices provide initial data consist- 
ing of “correlations” that is, matched-filter values, between the signal and 
a set of known test signals. The big question is: How do we select this set 
of test signals? One might think that it is best to use as the test signals the 
members of the basis with respect to which the data is sparsely represented. 
Certainly, a small number of correlations would suffice to capture the sig- 
nal, since the representation is known to be sparse. But we don’t know 
which basis members are the important ones, and we would have to obtain 
a large number of correlations to find out which ones are the important 
ones, defeating the purpose of reducing the sampling effort. The solution, 
surprisingly, is to select the test signals at random, making the test signals 
quite unlike the basis vectors that produced the sparse representation. 

At the final step, an algorithm is then applied to extract the desired 
information from this smaller set of correlations. Now we seek a solution 
that is consistent with the sampled data and also sparse, with respect to the 
original basis. With b the vector of correlations, x a vector of coefficients 
in the sparse-representation basis, and A the matrix describing the linear 
transformation, we seek a maximally sparse solution of Ax = b. Finding 
such a maximally sparse solution is not easy; it is an NP-hard problem. It 
has been discovered that finding the minimum one-norm solution is often a 
reasonable substitute, which means that the computation can be converted 
to a linear-programming problem. 
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22.3 Compressed Sensing 


The objective in CS is to exploit sparseness to reconstruct a vector f 
in R7 from relatively few linear functional measurements [67]. 

Let U = {ul,u?,...,u7} and V = {v1,v?,...,v7} be two orthonormal 
bases for RY, with all members of R7 represented as column vectors. For 
4=1,2,..., J, let 


= j 
m= max {lei v) 


and 


(U, V) = max, p. 


We know from Cauchy’s Inequality that 
Iut, vw) < 1, 


and from Parseval’s Equation 


J e 
Yi P = il? = 1 
j=1 


Therefore, we have 


1 
— < u(U,V) < 1 
The quantity u(U, V) is the coherente measure of the two bases; the closer 
u(U, V) is to the lower bound of Wad the more incoherent the two bases 


are. We give an example of incoherent bases for C7. 

Let U = {u',u?,...,u7} be the usual orthonormal basis for C7, where 
all the entries of uf are zero, except that uj = 1. Let V = {v',v?,...,u7} 
be the Fourier basis, with the kth entry of v? given by 


j __l 2rikj/J 

Uy 7 ; 

Then it is easy to show that uw(U,V) = FF Clearly, each vector uf has 
a maximally sparse representation in the U basis, but not in the V basis. 
Similarly, each vf has a maximally sparse representation in the V basis, 
but not in the U basis. When J is large, we may well want to estimate 
the index j from the measurement of relatively few coefficients of uJ in the 
V-basis representation. This is compressed sampling. 

Let f be a fixed member of R7; we expand f in the V basis as 


f= Liv! + L20? +... + ryv. 
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We say that the coefficient vector x = (21,...,@ 7) is s-sparse if s is the 
number of nonzero zj. 

If s is small, most of the x; are zero, but since we do not know which 
ones these are, we would have to compute all the linear functional values 


Djs (F vt) 


to recover f exactly. In fact, the smaller s is, the harder it would be to learn 
anything from randomly selected xj, since most would be zero. The idea in 
CS is to obtain measurements of f with members of a different orthonormal 
basis, which we call the U basis. If the members of U are very much like 
the members of V, then nothing is gained. But, if the members of U are 
quite unlike the members of V, then each inner product measurement 


yi = (fou) = fru 


should tell us something about f. If the two bases are sufficiently inco- 
herent, then relatively few y; values should tell us quite a bit about f. 
Specifically, we have the following result due to Candés and Romberg [46]: 
suppose the coefficient vector x for representing f in the V basis is s-sparse. 
Select uniformly randomly J < J members of the U basis and compute the 
measurements y; = (f,u’). Then, if J is sufficiently large, it is highly prob- 
able that z = x also solves the problem of minimizing the one-norm 


[lzll1 = [zal + [z2] +... + zal, 
subject to the conditions 
yi = (gu) =g, 
for those M randomly selected u’, where 


g= zu! + zzv? ee zu’. 
The smaller u(U,V) is, the smaller the J is permitted to be without reduc- 
ing the probability of perfect reconstruction. 


22.4 Sparse Solutions 


Suppose that A is a real I by J matrix, with I < J, and that the linear 
system Az = b has infinitely many solutions. For any vector x, we define 
the support of x to be the subset S of {1,2,..., J} consisting of those j 
for which the entries x; # 0. For any under-determined system Ax = b, 
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there will, of course, be at least one solution of minimum support, that is, 
for which |S], the size of the support set S, is minimum. However, finding 
such a maximally sparse solution requires combinatorial optimization, and 
is known to be computationally difficult. It is important, therefore, to have 
a computationally tractable method for finding maximally sparse solutions. 
The discussion in this section is based on [16]. 


22.4.1 Maximally Sparse Solutions 


Consider the problem Po: among all solutions x of the consistent sys- 
tem b = Az, find one, call it ĉ, that is maximally sparse, that is, has the 
minimum number of nonzero entries. Obviously, there will be at least one 
such solution having minimal support, but finding one, however, is a com- 
binatorial optimization problem and is generally NP-hard. For notational 
convenience, we denote by ||z||o the number of nonzero entries of x. 

There are two basic questions concerning the problem Po: 


1. Can uniqueness of the solution be claimed? Under what conditions? 


2. If a candidate for the solution is available, is there a simple test to 
determine if it is, in fact, a solution? 


Definition 22.1 Let A be an I by J matriz, with I < J. The spark of A 
is the smallest number of linearly dependent columns. 


We denote the spark of A by sp(A). The definition of the spark of A is 
superficially similar to that of the rank of A, but the spark is a more 
difficult quantity to calculate. Notice that, if we change the word “columns” 
to “rows” in the definition, we may get a different number. For example, 
the 5 by 6 matrix 


100 1 0 0 
0 10 01 0 
A=|0 0100 1 
0 00 0 0 0 
0 00 00 0 


has a rank of 3 and a spark of 2, although the smallest number of linearly 
dependent rows is 1. The rank of the matrix 


w 

II 
ooooF 
oooro 
OO HS OS 
OR 3S: OO 
FocCcnoeo 
eee ee 
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is 5, and the spark is 6, while the rank of the matrix 


1 0 0 0 0 1 
0 10 0 0 0 
C= |0 01 0 0 0 
000 1 0 0 
000 0 1 0 


is 5 and its spark is 2. The spark is an important notion when seeking 
sparse solutions of Ax = b, where we assume that I < J. The spark is not 
defined for matrices with I > J. 

The following theorem is not difficult to prove. 


Theorem 22.1 If Ax =b and ||æ||o < sp(A)/2, then x solves Pp. 


Unfortunately, calculating the spark of a matrix is typically more difficult 
than solving P. There is a simpler way, fortunately. We denote by az the 
kth column of the matrix A. 


Definition 22.2 The mutual coherence of the matrix A is 


la} a;| 


A = max T r i. 
ae PEET 


The matrix A is said to have nearly incoherent columns if (A) is nearly 
equal to zero. If A were square and orthogonal, then we would have ju(A) = 
0. However, we are assuming that A is I by J, with I < J, so that (A) > 0. 
The following lemma is helpful. 


Lemma 22.1 For any matrix A we have 


u(A) 
As a consequence, we get the following theorem. 


Theorem 22.2 If Ax =b and 


Izl <5(14+ 5), 


then x solves Po. 


Compressed Sensing 341 


22.4.2 Minimum One-Norm Solutions 


A more tractable problem is to seek a minimum one-norm solution, 
that is, we can solve the problem P;: minimize 


J 
llæl = So lel, 
j=l 


subject to Ax = b. Let x* be a solution of Pı. Problem P; can be formulated 
as a linear programming problem, so is more easily solved. The big questions 
are: when does Pı have a unique solution «*, and when is «* = ĉ? The 
problem P, will have a unique solution if and only if A is such that the 
one-norm satisfies 
Ie". < [læ* + olla, 

for all nonzero v in the null space of A. We have the following theorem. 
Theorem 22.3 If A is I by J, with full rank and I < J, and Ax = b, with 

1 1 

lello < 31+ —), 
2 L(A) 


then x solves both P) and P,. 


22.4.3 Minimum One-Norm as an LP Problem 


The entries of x need not be nonnegative, so the problem is not yet a 
linear programming problem. Let 


B=[A —A], 


and consider the linear programming problem of minimizing the function 


subject to the constraints z > 0, and Bz = b. Let z* be the solution. We 
write 


Then, as we shall see, x* = u* — v* minimizes the one-norm, subject to 
Ax =b. 

First, we show that ujv; = 0, for each j. If, say, there is a j such that 
0 < vj < uj, then we can create a new vector z by replacing the old uj 


with uj —v; and the old vj with zero, while maintaining Bz = b. But then, 
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since uj — v} < uj +v}, it follows that cl z < cT z*, which is a contradiction. 


Consequently, we have ||x*||1 = cT 2*. 


Now we select any x with Av = b. Write uj = xj, if x; > 0, and u; = 0, 
otherwise. Let vj = uj — xj, so that x = u — v. Then let 


z= f 
v 
Then b = Ax = Bz, and c’z = ||x||1. Consequently, 


le* lli = ete < z= lalla, 


and x* must be a minimum one-norm solution. 


22.4.4 Why the One-Norm? 


When a system of linear equations Ax = b is under-determined, we 
can find the minimum-two-norm solution that minimizes the square of the 


two-norm, 
J 
llel? = $ 2?, 
j=l 


subject to Ax = b. One drawback to this approach is that the two-norm 
penalizes relatively large values of x; much more than the smaller ones, 
so tends to provide non-sparse solutions. Alternatively, we may seek the 
solution for which the one-norm, 


J 
lela = So lel, 
j=l 


is minimized. The one-norm still penalizes relatively large entries x; more 
than the smaller ones, but much less than the two-norm does. As a result, 
it often happens that the minimum one-norm solution actually solves Po 
as well. 


22.4.5 Comparison with the PDFT 


The PDFT approach to solving the under-determined system Ax = b 
is to select weights w; > 0 and then to find the solution z that minimizes 
the weighted two-norm given by 


2 
jes 
where S is the support set of w, meaning that S is the set of all j for which 
wj > 0. Our intention is to select weights w; so that wy” is reasonably close 
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to |x;|. Consider, therefore, what happens when S is the support set of x* 


and wz‘ = |z}ž| for j € S. We show that ț is also a minimum-one-norm 
solution. 
To see why this is true, note that, for any x supported on S, we have 


xl = grj = nal *| 
ells =D Izl Yeh 
Sa 


X izz]. 
jes v5 jes 
pe B. Hi 
lah = S ie p u N lesi 
jES \ies A jES 


IA 


Therefore, 


IA 


a* |? 
T e 
\jes |z} jes 
= Xle = lle" lla. 
JES 


Therefore, č also minimizes the one-norm. 


22.4.6 Iterative Reweighting 


Let x be the truth. Generally, we want each weight w; to be a good 
prior estimate of the reciprocal of |x,;|. Because we do not yet know z, 
we may take a sequential-optimization approach, beginning with weights 
wy > 0, finding the PDFT solution using these weights, then using this 
PDFT solution to get a (we hope!) better choice for the weights, and so on. 
This sequential approach was successfully implemented in the early 1980’s 
by Michael Fiddy and his students [74]. 

In [47], the same approach is taken, but with respect to the one-norm. 
Since the one-norm still penalizes larger values disproportionately, balance 
can be achieved by minimizing a weighted-one-norm, with weights close to 
the reciprocals of the |x;|. Again, not yet knowing x, they employ a sequen- 
tial approach, using the previous minimum-weighted-one-norm solution to 
obtain the new set of weights for the next minimization. At each step of 
the sequential procedure, the previous reconstruction is used to estimate 
the true support of the desired solution. 

It is interesting to note that an on-going debate among users of the 
PDFT concerns the nature of the prior weighting. Does w; approximate 
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|x;|~* or |x;|~*? This is close to the issue treated in [47], the use of a weight 
in the minimum-one-norm approach. 

It should be noted again that finding a sparse solution is not usually 
the goal in the use of the PDFT, but the use of the weights has much the 
same effect as using the one-norm to find sparse solutions. To the extent 
that the weights approximate the entries of ĉ, their use reduces the penalty 
associated with the larger entries of an estimated solution. 


22.5 Why Sparseness? 


One obvious reason for wanting sparse solutions of Ax = b is that we 
have prior knowledge that the desired solution is sparse. Such a problem 
arises in signal analysis from Fourier-transform data. In other cases, such 
as in the reconstruction of locally constant signals, it is not the signal itself, 
but its discrete derivative, that is sparse. 


22.5.1 Signal Analysis 


Suppose that our signal f(t) is known to consist of a small number of 
complex exponentials, so that f(t) has the form 


J 
foe Se er, 
j=l 


for some small number of frequencies w; in the interval (0,27). For n = 
0,1,...,N—1, let fn = f(n), and let f be the vector in C% with entries fn; 
we assume that J is much smaller than N. The discrete (vector) Fourier 
transform of f is the vector F having the entries 


1 N-1 
F, = pee 
Tn & 


for k = 0,1,...,N — 1; we write F = Ef, where E is the N by N matrix 
with entries Exn = PERIN, If N is large enough, we may safely assume 
that each of the w; is equal to one of the frequencies 277k and that the 
vector F is J-sparse. The question now is: How many values of f(n) do we 
need to calculate in order to be sure that we can recapture f(t) exactly? 


We have the following theorem [45]: 


Theorem 22.4 Let N be prime. Let S be any subset of {0,1,..., N — 1} 
with |S| > 2J. Then the vector F can be uniquely determined from the 
measurements fn forn in S. 
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We know that 
f=E'F, 


where E' is the conjugate transpose of the matrix Æ. The point here is 
that, for any matrix R obtained from the identity matrix I by deleting 
N —|S| rows, we can recover the vector F from the measurements Rf. 

If J is not prime, then the assertion of the theorem may not hold, 
since we can have j = Omod J, without 7 = 0. However, the assertion 
remains valid for most sets of J frequencies and most subsets S' of indices; 
therefore, with high probability, we can recover the vector F from Rf. Note 
the similarity between this and Prony’s method. 

Note that the matrix E is unitary, that is, EE = I, and, equivalently, 
the columns of E form an orthonormal basis for C™. The data vector is 


b= Rf = RE'F. 


In this example, the vector f is not sparse, but can be represented sparsely 
in a particular orthonormal basis, namely as f = Et F, using a sparse vector 
F of coefficients. The representing basis then consists of the columns of the 
matrix Æt. The measurements pertaining to the vector f are the values fn, 
for n in S. Since fn can be viewed as the inner product of f with 6”, the 
nth column of the identity matrix J, that is, 


fin = (6", f), 


the columns of J provide the so-called sampling basis. With A = RE’ and 
x = F, we then have 
Ax = b, 


with the vector x sparse. It is important for what follows to note that the 
matrix A is random, in the sense that we choose which rows of I to use to 
form R. 


22.5.2 Locally Constant Signals 


Suppose now that the function f(t) is locally constant, its graph con- 
sisting of some number of horizontal lines. We discretize the function f(t) 
to get the vector f = (f(0), f(1),...,f(N — 1))*. The discrete derivative 
vector is g = (91, 92, ++; gN-1)", with 


Gn = f(n) — f(n = 1). 


Since f(t) is locally constant, the vector g is sparse. The data we will have 
will not typically be values f(n). The goal will be to recover f from M 
linear functional values pertaining to f, where M is much smaller than N. 
We shall assume, from now on, that we have measured, or can estimate, 
the value f(0). 
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Our M by 1 data vector d consists of measurements pertaining to the 


vector f: 
N-1 
m = X Amnfns 
n=0 


for m = 1,..., M, where the Hmn are known. We can then write 
N-1 N-1 N-1 
n=0 k=1 n=k 


Since f(0) is known, we can write 


z 


-1 


bm = dm — f(0)( > Hin) = > Ari 


1 


> 
Il 


where 
J 
n=k 


The problem is then to find a sparse solution of Ax = g. As in the previous 
example, we often have the freedom to select the linear functions, that is, 
the values Hmn, so the matrix A can be viewed as random. 


22.5.3 Tomographic Imaging 


The reconstruction of tomographic images is an important aspect of 
medical diagnosis, and one that combines aspects of both of the previous 
examples. The data one obtains from the scanning process can often be 
interpreted as values of the Fourier transform of the desired image; this is 
precisely the case in magnetic-resonance imaging, and approximately true 
for x-ray transmission tomography, positron-emission tomography (PET) 
and single-photon emission tomography (SPECT). The images one encoun- 
ters in medical diagnosis are often approximately locally constant, so the 
associated array of discrete partial derivatives will be sparse. If this sparse 
derivative array can be recovered from relatively few Fourier-transform val- 
ues, then the scanning time can be reduced. 

We turn now to the more general problem of compressed sampling. 


22.6 Compressed Sampling 


Our goal is to recover the vector f = (fi,..., fJ)! from I linear func- 
tional values of f, where J is much less than J. In general, this is not 


Compressed Sensing 347 


possible without prior information about the vector f. In compressed sam- 
pling, the prior information concerns the sparseness of either f itself, or 
another vector linearly related to f. 

Let U and V be unitary J by J matrices, so that the column vectors 
of both U and V form orthonormal bases for C7. We shall refer to the 
bases associated with U and V as the sampling basis and the representing 
basis, respectively. The first objective is to find a unitary matrix V so that 
f = Va, where zx is sparse. Then we want to find a second unitary matrix 
U such that, when an J by J matrix R is obtained from U by deleting 
rows, the sparse vector x can be determined from the data b = RVx = Az. 
Theorems in compressed sensing describe properties of the matrices U and 
V such that, when R is obtained from U by a random selection of the rows 
of U, the vector x will be uniquely determined, with high probability, as 
the unique solution that minimizes the one-norm. 
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Chapter Summary 


In this chapter we review a few important results from the theory of 
probability. 
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23.2 Independent Random Variables 


Let X1,...,X~ be N independent real random variables with the same 
mean (that is, expected value) u and same variance 07. The main conse- 
quence of independence is that E(X: X;) = E(X;)E(X;) = p? for i F j. 
Then, it is easily shown that the sample average 


N 
X=N7! 5 Xn 
n=1 


has u for its mean and o?/N for its variance. 


Ex. 23.1 Prove these two assertions. 


23.3 Maximum Likelihood Parameter Estimation 


Suppose that the random variable X has a probability density function 
p(x; 0), where @ is an unknown parameter. A common problem in statistics 
is to estimate 6 from independently sampled values of X, say x1, ..., £N. À 
frequently used approach is to maximize the function of 0 given by 


N 
L(0) = L(0; x1, ...,0n) = II p(n; 8). 


n=1 


The function L(0) is the likelihood function and a value of 0 maximizing 
L(0) is a maximum likelihood estimate. We give two examples of maximum 
likelihood (ML) estimation. 


23.3.1 An Example: The Bias of a Coin 


Let 0 in the interval [0, 1] be the unknown probability of success on one 
trial of a binomial distribution (a coin flip, for example), so that the prob- 
ability of k successes in N trials is L(8; k, N) = mee (1 = 0)N-*, for 
k = 0,1,..., N. If we have observed N trials and have recorded k successes, 
we can estimate @ by selecting that Ê for which L(0,k, N) is maximized as 
a function of 6. 


Ex. 23.2 Show that, for the binomial case described above, the maximum 


likelihood estimate of 0 is 6 = k/N. 
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23.3.2 Estimating a Poisson Mean 


A random variable X taking on only nonnegative integer values is said to 
have the Poisson distribution with parameter A > 0 if, for each nonnegative 
integer k, the probability pẹ that X will take on the value k is given by 


pr =e >A" /kl. 
Ex. 23.3 Show that the sequence {pk}? o sums to one. 


Ex. 23.4 Show that the expected value E(X) is A, where the expected value 
in this case is 
co 
-Sin 
k=0 


Ex. 23.5 Show that the variance of X is also A, where the variance of X 
in this case is 


(k — A) pk 


Me 


var(X) = 


> 
ll 


0 


Ex. 23.6 Show that the ML estimate of A based on N independent samples 
is the sample mean. 


23.4 Independent Poisson Random Variables 


Let Z,...,Zy be independent Poisson random variables with expected 
value E(Zn) = An. Let Z be the random vector with Zn as its entries, A the 
vector whose entries are the An, and Ay = een An. Then the probability 
function for Z is 


N 
F(Z) = I Az” exp(—An)/Zn! = exp(—A+) II Non len! . 
Now let Y = Sie Zn. Then, the probability function for Y is 


N 
Prob(Y = y) = Prob(Z)+4+...4Zn = y) = 5 exp(— à+) II Nz lan! . 


Z1+...2N=Y n=1 


But, as we shall see shortly, Y is a Poisson random variable with E(Y) = 
A+, since we have 


N 
XO exp(—Aa) [] 3" /2n! = exp(—A4) a4 /y!. (23.1) 


Zit...z2n=y n=1 
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When we observe an instance of y, we can consider the conditional 
distribution f(Z|\,y) of {Z1,..., ZN}, subject to y = Zı +... + Zyn. We 


have ji ate D 
Hza = al (5S) 


This is a multinomial distribution. Given y and A, the conditional expected 
value of Zn is then E(Z,|A, y) = yAn/A+. To see why Equation (23.1) is 
true, we discuss the multinomial distribution. 


23.5 The Multinomial Distribution 


When we expand the quantity (a, +...+ay)¥, we obtain a sum of terms, 
each of the form aj?...aX”, with z1 + ... + zy = y. How many terms of the 
same form are there? There are N variables. We are to select z, of type 
n, for each n = 1,...,N, to get y = 21 +... + zyn factors. Imagine y blank 
spaces, to be filled in by various factor types as we do the selection. We 
select z, of these blanks and mark them aj, for type one. We can do that 
in (2) ways. We then select z2 of the remaining blank spaces and enter 
az in them; we can do this in re ways. Continuing in this way, we find 
that we can select the N factor types in 


y\(y-a)\ - y — (z1 +... + ZN_2) 
eee ae 


y! (y — (z1 +.. + zn)! _ y! 


zly — 21)! zn-1!(y — (z231 ++ 2zN-1)! zalezy! 


ways, or in 


This tells us in how many different sequences the factor types can be se- 
lected. Applying this, we get the multinomial theorem: 


! 
(ai +... tan)’ = 5 att a 


Select an = An/A+. Then, 


w= (E+ 


n 
II 


ll 
R 
iene 

Ned 
Je 
zi! 
SoS 
>| > 
ae 
Neca 

AN 
S 
>] > 
+ |2 
ee 

XR 

z 
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From this we get 


N 
NO o expl) [ [A7 /2n! = exp(—A4)A4 /y! . 
n=1 


Zi+...2N=y 


23.6 Characteristic Functions 


The Fourier transform shows up in probability theory in the guise of the 
characteristic function of a random variable. The characteristic function is 
related to, but more general than, the moment-generating function and 
serves much the same purposes. 

A real-valued random variable X is said to have the probability density 
function (pdf) f(x) if, for any interval [a,b], the probability that X takes 
its value within this interval is given by the integral iE f(x)dx. To be a 
pdf, f(x) must be nonnegative and f°. f(«)dx = 1. The characteristic 
function of X is then 


F(w) = a f(aje® de. 


The formulas for differentiating the Fourier transform are quite useful in 
determining the moments of a random variable. 
The expected value of X is 


and for any real-valued function g(x) the expected value of the random 
variable g(X) is 


The nth moment of X is 
B(x") = fa fled 


the variance of X is then var(X)= E(X?) — E(X)?. It follows, therefore, 
that the nth moment of the random variable X is given by 


E(X”) = (i)"F™ (0). 
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If we have N real-valued random variables X),...,X 1, their joint prob- 
ability density function is f(a1,...,.0n) > 0 having the property that, for 
any intervals [a1, b1], ..., |an, bn], the probability that X» takes its value 
within [an, bn], for each n, is given by the multiple integral 


bi bn 
Jae oe (a1,...,Un)dx1--- day. 


The joint moments are then 


Bam xR) = f -f LI ay f(ai,...,0n)dx,--- dan. 


The joint moments can be calculated by evaluating at zero the partial 
derivatives of the characteristic function of the joint pdf. 
The random variables are said to be independent if 


(£1, EN) = f(21)  f(en), 


where, in keeping with the convention used in the probability literature, 
f(£n) denotes the pdf of the random variable X». 

If X and Y are independent random variables with probability density 
functions f(x) and g(y), then the probability density function for the ran- 
dom variable Z = X +Y is (f x g)(z), the convolution of f and g. To see 
this, we first calculate the cumulative distribution function 


H(z) = Prob (X +Y < 2), 


He) = f j J  F(a)alu)dyde. 


which is 


=— 00 =—00 


Using the change of variable t = x + y, we get 


a= [7 g(t — x)dtdz. 


The pdf for the random variable Z is h(z) = H’ (z), the derivative of H(z). 
Differentiating the inner integral with respect to z, we obtain 


+00 
ne) = f Fle)gle = a)da; 
therefore, h(z) = (f*g)(<z). It follows that the characteristic function for the 
random variable Z = X + Y is the product of the characteristic functions 
for X and Y. 
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23.7 Gaussian Random Variables 


A real-valued random variable X is called Gaussian or normal with 
mean u and variance o? if its probabilty density function (pdf) is 


In the statistical literature a normal random variable is standard if its mean 
is u = 0 and its variance is g? = 1. 


23.7.1 Gaussian Random Vectors 


Suppose now that Z1,...,Zx are independent standard normal random 
variables. Then, their joint pdf is the function 


a AN E Il =o (-32) 2 were? (ie Fe 2) 


By taking linear combinations of these random variables, we can obtain a 
new set of normal random variables that are no longer independent. For 


each m = 1,..., M let 
N 
=o 
n=1 


Then E(Xm) = 0. 
The covariance matrix associated with the X,, is the matrix R with 
entries Rmn = E(XmXn), m,n = 1,2, ..., M. We have 


N N 


E(XmXn) => Amk X AnjE(ZpZ5). 


k=1 j=1 


Since the Z,, are independent with mean zero, we have E(ZZ;) = 0 for 
k +4 j and E(Z?) = 1. Therefore, 


N 
Xn) = 5 AmkAnk; 
k=1 


and the covariance matrix is R = AA’. 
Writing X = (X1, ..., Xm) and Z = (Z1, ..., Zw)", we have X = AZ, 
where A is the M by N matrix with entries Amn. Using the standard 
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formulas for changing variables, we find that the joint pdf for the random 
variables X1,..., Xm is 


1 1 othe cnt 
f(z,- mM) Ja VAN exp ( 5X R x) s 
with x = (x1,...,2y)". For the remainder of this chapter, we limit the 
discussion to the case of M = N = 2 and use the notation Xı = X, 
Xə =Y and f(x1,22) = f(x,y). We also let p = E(XY)/oj102. 
The two-dimensional FT of the function f(x,y), the characteristic func- 
tion of the Gaussian random vector X, is 


F(a,8) = exp (—5(o}a? +038? + 20102908) ) ; 


Ex. 23.7 Use partial derivatives of F(a, 8) to show that E(X?Y?) = 
20202p°. 


Ex. 23.8 Show that E(X?Y?) = E(X?)E(Y?) +2E(XY)?. 


23.7.2 Complex Gaussian Random Variables 


Let X and Y be independent real Gaussian random variables with 
means ji, and uy, respectively, and common variance o°. Then W = X+iY 
is a complex Gaussian random variable with mean pw = E(W) = Uz + ity 
and variance 02, = 20°. 

The results of Exercise 23.7 extend to complex Gaussian random vari- 
ables W and V. In the complex case we have 


B(\VP|W|*) = EV |?) E(w?) + |EV W). 


This is important in optical image processing, where it is called the 
Hanbury-Brown Twiss effect and provides the basis for intensity interfer- 
ometry [78]. The main point is that we can obtain magnitude information 
about E(VW), but not phase information, by measuring the correlation 
between the magnitudes of V and W; that is, we learn something about 
E(VW) from intensity measurements. Since we have only the magnitude 
of E(VW), we then have a phase problem. 


23.8 Using A Priori Information 


We know that to get information out we need to put information in; but 
how to do it is the problem. One approach that is quite popular within the 
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image-reconstruction community is the use of statistical Bayesian methods 
and maximum a posteriori (MAP) estimation. 


23.9 Conditional Probabilities and Bayes’ Rule 


Suppose that A and B are two events with positive probabilities P(A) 
and P(B), respectively. The conditional probability of B, given A, is defined 
to be P(B|A) = P(AN B)/P(A). It follows that Bayes’ Rule holds: 


P(A|B) = P(B\A)P(A)/P(B). 


To illustrate the use of this rule, we consider the following example. 


23.9.1 An Example of Bayes’ Rule 


Suppose that, in a certain town, 10 percent of the adults over 50 have 
diabetes. The town doctor correctly diagnoses those with diabetes as having 
the disease 95 percent of the time. In two percent of the cases he incorrectly 
diagnoses those not having the disease as having it. Let D mean that the 
patient has diabetes, N that the patient does not have the disease, A that 
a diagnosis of diabetes is made, and B that a diagnosis of diabetes is not 
made. The probability that he will diagnose a given adult as having diabetes 
is given by the rule of total probability: 


P(A) = P(A|D)P(D) + P(A|N) P(N). 


In this example, we obtain P(A) = 0.113. Now suppose a patient receives a 
diagnosis of diabetes. What is the probability that this diagnosis is correct? 
In other words, what is P(D|A)? For this we use Bayes’ Rule: 


P(D|A) = P(A|D)P(D)/P(A), 


which turns out to be 0.84. 


23.9.2 Using Prior Probabilities 


So far nothing is controversial. The fun begins when we attempt to 
broaden the use of Bayes’ Rule to ascribe a priori probabilities to quantities 
that are not random. The example used originally by Thomas Bayes in the 
eighteenth century is as follows. Imagine a billiard table with a line drawn 
across it parallel to its shorter side, cutting the table into two rectangular 
regions, the nearer called A and the farther B. Balls are tossed on to the 
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table, coming to rest in either of the two regions. Suppose that we are told 
only that after N such tosses n of the balls ended up in region A. What is 
the probability that the next ball will end up in region A? 

At first it would seem that we cannot answer this question unless we 
are told the probability of any ball ending up in region A; Bayes argues 
differently, however. Let A be the event that a ball comes to rest in region 
A, and let P(A) = x be the unknown probability of coming to rest in region 
A; we may consider x to be the relative area of region A, although this is 
not necessary. Let D be the event that n out of N balls end up in A. Then, 


P(D\|x) = C) aay". 


Bayes then adopts the view that the horizontal line on the table was ran- 
domly positioned so that the unknown x can be treated as a random vari- 
able. Using Bayes’ Rule, we have 


P(a|D) = P(D\x)P(a)/P(D), 


where P(x) is the probability density function (pdf) of the random variable 
x, which Bayes takes to be uniform over the interval [0,1]. Therefore, we 
have 


N 
P(a|D) = o( era =)", 
n 
where c is chosen so as to make P(x|D) a pdf. 


Ex. 23.9 Use integration by parts or the Beta function to show that 


(a [oa Se aN 


n 


N+1 
n+1 


forn=0,1,...,N. 


) i a" (1 — x) ”de = 1/(N +2) 


From the exercise we can conclude that c = N + 1; therefore we have the 
pdf P(2|D). Now we want to estimate x itself. One way to do this is to 
calculate the expected value of this pdf, which, according to the exercise, is 
(n+ 1)/(N + 2). So even though we do not know z, we can reasonably say 
(n+ 1)/(N + 2) is the probability that the next ball will end up in region 
A, given the behavior of the previous N balls. 

There is a second way to estimate x; we can find the value of x for which 
the pdf reaches its maximum. A quick calculation shows this value to be 
n/N. This estimate of x is not the same as the one we calculated using the 
expected value but they are close for large N. 
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What is controversial here is the decision to treat the positioning of the 
line as a random act, whereby x becomes a random variable, as well as 
the selection of a specific pdf to govern the random variable x. Even if x 
were a random variable, we do not necessarily know its pdf. Bayes takes 
the pdf to be uniform over [0,1], more as an expression of ignorance than 
of knowledge. It is this broader use of prior probabilities that is generally 
known as Bayesian methods and not the use of Bayes’ Rule itself. 


23.10 Maximum A Posteriori Estimation 


Bayesian methods provide us with an alternative to maximum likelihood 
parameter estimation. Suppose that a random variable (or vector) Z has the 
pdf f(z;@), where 6 is a parameter. When we hold z fixed and view f(z; 6) 
as a function only of 0, it is called the likelihood function. Having observed 
an instance of Z, call it z, we can estimate the parameter 0 by selecting 
that value for which the likelihood function f(z; 0) has its maximum. This 
is the mazimum likelihood (ML) estimator. Alternatively, suppose that we 
treat 0 itself as one value of a random variable © having its own pdf, say 
g(@). Then, Bayes’ Rule says that the conditional pdf of ©, given z, is 


g(0|z) = f(2;0)g(0)/ F (2), 
where 


f(z) = f £(2:6)g(0)ab 


The maximum a posteriori (MAP) estimate of 0 is the one for which the 
function g(@|z) is maximized. Taking logs and ignoring terms that do not 
involve 0, we find that the MAP estimate of 0 maximizes the function 
log f(z; @) + log g(9). 

Because the ML estimate maximizes log f(z;0), the MAP estimate is 
viewed as involving a penalty term log g(@) missing in the ML approach. 
This penalty function is based on the prior pdf g(@). We have flexibility 
in selecting g(@) and often choose g(@) in a way that expresses our prior 
knowledge of the parameter @. 
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23.11 MAP Reconstruction of Images 


In emission tomography the parameter 0 is actually a vectorized image 
that we wish to reconstruct and the observed data constitute z. Our prior 
knowledge about 0 may be that the true image is near some prior estimate, 
say p, of the correct answer, in which case g(@) is selected to peak at p 
[105]. Frequently our prior knowledge of 6 is that the image it represents is 
nearly constant locally, except for edges. Then g(@) is designed to weight 
more heavily the locally constant images and less heavily the others [82, 
85, 106, 89, 109]. 


23.12 Penalty-Function Methods 


The so-called penalty function that appears in the MAP approach comes 
from a prior pdf for 0. This suggests more general methods that involve a 
penalty function term that does not necessarily emerge from Bayes’ Rule 
[34]. Such methods are well-known in optimization. We are free to estimate 
0 as the maximizer of a suitable objective function whether or not that 
function is a posterior probability. Using penalty-function methods permits 
us to avoid the controversies that accompany Bayesian methods. 


23.13 Basic Notions 


The covariance between two complex-valued random variables x and y 
is 


covzy = E((2 — E(2))(y— E@))), 


and the correlation coefficient is 


Pry = CVeay/V E(x — E(x)|?) v E(ly — E(y)|?). 


The two random variables are said to be uncorrelated if and only if pry = 0. 
The covariance matriz of a random vector v is the matrix Q whose entries 
are the covariances of all the pairs of entries of v. The vector v is said to be 
uncorrelated if Q is diagonal; otherwise, we call v correlated. If the expected 
value of each of the entries of v is zero, we also have Q = E(vv'). We saw 
in our discussion of the BLUE that when the noise vector v is correlated 


Probability 361 


we need to employ the covariance matrix to obtain the best linear unbiased 
estimator. 


23.14 Generating Correlated Noise Vectors 


We can obtain an N by 1 correlated-noise random vector v as follows. 
Select a positive integer K, an arbitrary N by K matrix C, and K inde- 
pendent standard normal random variables 21,...,2«K; that is, their means 
are equal to zero and their variances are equal to one. Then let z be the 
random vector with entries z,. Define v = Cz. Then, we have E(v) = 0 
and E(vv') = CCt = Q. In fact, for the Gaussian case, this is the only way 
to obtain a correlated Gaussian random vector. The matrix C producing 
the covariance matrix Q is not unique. 


23.15 Covariance Matrices 


In order for Q to be a covariance matrix, it is necessary and sufficient 
that it be Hermitian and nonnegative-definite; that is, Q = Q and the 
eigenvalues of Q are nonnegative. Given any such Q, we can create an N 
by 1 noise vector v having Q as its covariance matrix using the eigen- 
value/eigenvector decomposition of Q. Then, taking U to be the matrix 
whose columns are the orthonormal eigenvectors of Q and L the diagonal 
matrix whose diagonal entries are An, n = 1,..., N, the eigenvalues of Q, we 
have Q = ULU'. For convenience, we assume that A; > A2 >... > Ày > 0. 
Let z be a random N by 1 vector whose entries are independent, standard 
normal random variables, and let C = UWLU", the Hermitian square root 
of Q. Then, v = Cz has Q for its covariance matrix. 

If we write this v as 


v = (UVLU')z = U(VLU'2) = Up, 


then p = VLU'z is uncorrelated; E(pp') = L. 
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23.16 Principal Component Analysis 


We can write the vector v = Up as 


N 
VS So nu”, 
n=1 
so that the entries of v are 
N 
n=1 


where u” = (U1,n,---;UN.n)! is the eigenvector of Q associated with eigen- 
value Àn. Since the variance of pn is An, Equation (23.2) decomposes the 
vector v into components of decreasing strength. The terms in the sum cor- 
responding to the smaller indices describe most of v; they are the principal 
components of v. Each pn is a linear combination of the entries of v, and 
principal component analysis consists of finding these uncorrelated linear 
combinations that best describe the correlated entries of v. The represen- 
tation v = Up expresses v as a linear combination of orthonormal vectors 
with uncorrelated coefficients. This is analogous to the Karhunen-Loéve 
expansion for stochastic processes [2]. 

Principal component analysis has as its goal the approximation of the 
covariance matrix Q = E(vv') by nonnegative-definite matrices of lower 
rank. A related area is factor analysis, which attempts to describe the N 
by N covariance matrix Q as Q = AAt + D, where A is some N by J 
matrix, for some J < N, and D is diagonal. Factor analysis attempts to 
account for the correlated components of Q using the lower-rank matrix 
AAt. Underlying this is a model for the random vector v: 


v= Ax+w, 


where both x and w are uncorrelated. The entries of the random vector x 
are the common factors that affect each entry of v while those of w are 
the special factors, each associated with a single entry of v. Factor analysis 
plays an increasingly prominent role in signal and image processing [17] as 
well as in the social sciences. 

In [151] Gil Strang points out that, from a linear algebra standpoint, 
factor analysis raises some questions. As his example shows, the represen- 
tation of Q as Q = AAt+D is not unique. The matrix Q does not uniquely 
determine the size of the matrix A: 
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1.74 24 24 7 5 
TA 1 4 4 7 ee 7 7 1 
Q=Jo4 a 1 yal 5l -s| |5 5 -5 -5| +26! 

24 24 T74 1 eS pa 
and 
: ve : 6 6 4 4 
gae V V38 V38 0 0 | 267. 
4 o v| | ; = VE 
4 0 58 V y 


It is also possible to represent Q with different diagonal components D. 
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Using the Wave Equation 
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24.1 Chapter Summary 


In this chapter we demonstrate how the problem of Fourier-transform 
estimation from sampled data arises in the processing of measurements 
obtained by sampling electromagnetic- or acoustic-field fluctuations, as in 
radar or sonar. We continue the discussion, begun in Chapter 9, of plane- 
wave solutions of the wave equation. To illustrate the use of non-plane-wave 
solutions we consider the problem of detecting a source of acoustic energy 
in a shallow-water environment. 


24.2 The Wave Equation 


In many areas of remote sensing, what we measure are the fluctuations 
in time of an electromagnetic or acoustic field. Such fields are described 
mathematically as solutions of certain partial differential equations, such 
as the wave equation. A function u(x, y,z,t) is said to satisfy the three- 
dimensional wave equation if 


2 202 
Utt = C (Ure + Uyy + Uzz) = V4, 


where uz; denotes the second partial derivative of u with respect to the time 
variable t twice and c > 0 is the (constant) speed of propagation. More 
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complicated versions of the wave equation permit the speed of propagation 
c to vary with the spatial variables x,y,z, but we shall not consider that 
here. 

Using the method of separation of variables, we start with solutions 
u(t, x,y,z) having the simple form 


u(t, x,y, z) = f(t)g(2, y, z). 


Inserting this separated form into the wave equation, we get 
F Ogy, 2) = P FEOV? 9(a,y, 2) 


or 
f"(t)/Ff(t) = eV? g(x, Y, z)/g(x,y, Z): 


The function on the left is independent of the spatial variables, while the 
one on the right is independent of the time variable; consequently, they 
must both equal the same constant, which we denote —w?. From this we 
have two separate equations, 


F © +w f(t) = 0, (24.1) 
and 
2 
V29(2,y,2) + 90.4, z)=0. (24.2) 


Equation (24.2) is the Helmholtz equation. 

Equation (24.1) has for its solutions the functions f(t) = cos(wt) and 
sin(wt), or, in complex form, the complex exponential functions f(t) = e*t 
and f(t) = e™™t. Functions u(t,z,y,z) = f(t)g(a,y,z) with such time 
dependence are called time-harmonic solutions. 

In three-dimensional spherical coordinates with r = yx? + y? + 22 a 
radial function u(r, t) satisfies the wave equation if 


(tet) 
Utt =C Uer ae : 


Radial solutions to the wave equation have the property that at any fixed 
time the value of u is the same for all the points on a sphere centered at the 
origin; the curves of constant value of u are these spheres, for each fixed 
time. 

Suppose that at time t = 0 the function h(r,0) is zero except for r 
near zero; that is, initially, there is a localized disturbance centered at the 
origin. As time passes that disturbance spreads out spherically. When the 
radius of a sphere is very large, the surface of the sphere appears planar, 
to an observer on that surface, who is said then to be in the far field. This 
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motivates the study of solutions of the wave equation that are constant on 
planes; the so-called plane-wave solutions. 

We simplify the situation by assuming that all the plane-wave solutions 
are associated with the same frequency, w. In the continuous superposition 
model, the field is a superposition of plane waves; 


u(s, t) = t j f(k)e™ sdk. 


Our measurements at the sensor locations Sm give us the values 


Fen) = | paerd, 


for m = 1, ..., M. The data are then Fourier transform values of the complex 
function f(k); f(k) is defined for all three-dimensional real vectors k, but 
is zero, in theory, at least, for those k whose squared length ||k||? is not 
equal to w/c”. Our goal is then to estimate f(k) from finitely many values 
of its Fourier transform. Since each k is a normal vector for its plane-wave 
field component, determining the value of f(k) will tell us the strength of 
the plane-wave component coming from the direction k. 

The collection of sensors at the spatial locations Sm, m = 1,...,M, 
is called an array, and the size of the array, in units of the wavelength 
Aà = 2nc/w, is called the aperture of the array. Generally, the larger the 
aperture the better, but what is a large aperture for one value of w will 
be a smaller aperture for a lower frequency. The book by Haykin [88] is a 
useful reference, as is the review paper by Wright, Pridham, and Kay [164]. 

In some applications the sensor locations are essentially arbitrary, while 
in others their locations are carefully chosen. Sometimes, the sensors are 
collinear, as in sonar towed arrays. Let’s look more closely at the collinear 
case. 

We assume now that the sensors are equi-spaced along the z-axis, at 
locations (mA, 0,0), m = 1, ..., M, where A > 0 is the sensor spacing; such 
an arrangement is called a uniform line array. This setup was illustrated 
in Figure 9.1 in Chapter 9. Our data is then 


Fm = F (8m) = F((mA, 0,0)) = SOON 


Since k - (1,0,0) = “cos, for 0 the angle between the vector k and the 

x-axis, we see that there is some ambiguity now; we cannot distinguish the 

cone of vectors that have the same @. It is common then to assume that the 

wavevectors k have no z-component and that @ is the angle between two 

vectors in the x, y-plane, the so-called angle of arrival. The wavenumber 
Ww Ww 


variable k = # cos@ lies in the interval [-£, #], and we imagine that f(k) 


cY è 


is now f(k), defined for |k| < #. The Fourier transform of f(k) is F(s), a 
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function of a single real variable s. Our data is then viewed as the values 
F(mA), for m = 1,...,M. Since the function f(k) is zero for |k| > 2, the 
Nyquist spacing in s is 4$, which is à, where \ = ane is the wavelength. 

To avoid aliasing, which now means mistaking one direction of arrival 
for another, we need to select A < à, When we have oversampled, so that 
A< à, the interval [-#, #], the so-called visible region, is strictly smaller 
than the interval [-4, 4]. If the model of propagation is accurate, all the 
signal component plane waves will correspond to wavenumbers k in the 
visible region and the background noise will also appear as a superposition 
of such propagating plane waves. In practice, there can be components in 
the noise that appear to come from wavenumbers k outside of the visible 
region; this means these components of the noise are not due to distant 
sources propagating as plane waves, but, perhaps, to sources that are in 
the near field, or localized around individual sensors, or coming from the 
electronics within the sensors. 

Using the relation Aw = 2rc, we can calculate the Nyquist spacing 
for any particular case of plane-wave array processing. For electromagnetic 
waves the propagation speed is the speed of light, which we shall take here 
to be c = 3 x 108 meters per second. The wavelength \ for gamma rays 
is around one Angstrom, which is 10~'° meters; for x-rays it is about one 
millimicron, or 107° meters. The visible spectrum has wavelengths that 
are a little less than one micron, that is, 1076 meters. Shortwave radio has 
wavelength around one millimeter; broadcast radio has a À running from 
about 10 meters to 1000 meters, while the so-called long radio waves can 
have wavelengths several thousand meters long. At the one extreme it is 
impractical (if not physically impossible) to place individual sensors at the 
Nyquist spacing of fractions of microns, while at the other end, managing 
to place the sensors far enough apart is the challenge. 

In ocean acoustics it is usually assumed that the speed of propagation 
of sound is around 1500 meters per second, although deviations from this 
ambient sound speed are significant and since they are caused by such things 
as temperature differences in the ocean, can be used to estimate these 
differences. At around the frequency w = 50 Hz, we find sound generated 
by man-made machinery, such as motors in vessels, with higher frequency 
harmonics sometimes present also; at other frequencies the main sources of 
acoustic energy may be wind-driven waves or whales. The wavelength for 
50 Hz is \ = 30 meters; sonar will typically operate both above and below 
this wavelength. It is sometimes the case that the array of sensors is fixed 
in place, so what may be Nyquist spacing for 50 Hz will be oversampling 
for 20 Hz. 
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It is often the case that we are primarily interested in the values |f(k)], 
not the complex values f(k). Since the Fourier transform of the function 
| f(k)|? is the autocorrelation function obtained by convolving the function 
F with F, we can mimic the approach used earlier for power spectrum 
estimation to find |f(k)|. We can now employ the nonlinear methods such 
as Burg’s MEM and Capon’s maximum-likelihood method. 

In array processing, as in other forms of signal and image processing, we 
want to remove the noise and enhance the information-bearing component, 
the signal. To do this we need some idea of the statistical behavior of 
the noise, we need a physically accurate description of what the signals 
probably look like, and we need a way to use this information. Much of our 
discussion up to now has been about the many ways in which such prior 
information can be incorporated in linear and nonlinear procedures. We 
have not said much about the important issue of the sensitivity of these 
methods to mismatch; that is, what happens when our physical model is 
wrong or the statistics of the noise is not what we thought it was? We 
did note earlier how Burg’s MEM resolves closely spaced sinusoids when 
the background is white noise, but when the noise is correlated, MEM can 
degrade rapidly. 

Even when the physical model and noise statistics are reasonably ac- 
curate, slight errors in the hardware can cause rapid degradation of the 
processor. Sometimes acoustic signal processing is performed with sensors 
that are designed to be expendable and are therefore less expensive and 
more prone to errors than more permanent equipment. Knowing what a 
sensor has received is important, but so is knowing when it received it. 
Slight phase errors caused by the hardware can go unnoticed when the 
data is processed in one manner, but can ruin the performance of another 
method. 

The information we seek is often stored redundantly in the data and 
hardware errors may harm only some of these storage locations, making 
robust processing still possible. As we saw in our discussion of eigenvec- 
tor methods, information about the frequencies of the complex exponential 
components of the signal are stored in the roots of the polynomials ob- 
tained from some of the eigenvectors. In [28] it was demonstrated that, in 
the presence of correlated noise background, phase errors distort the roots 
of some of these polynomials more than others; robust estimation of the 
frequencies is still possible if the stable roots are interrogated. 

We have focused here exclusively on plane-wave propagation, which re- 
sults when the source is far enough away from the sensors and the speed of 
propagation is constant. In many important applications these conditions 
are violated, and different versions of the wave equation are needed, which 
have different solutions. For example, sonar signal processing in environ- 
ments such as shallow channels, in which some of the sound reaches the 
sensors only after interacting with the ocean floor or the surface, requires 
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more complicated parameterized models for solutions of the appropriate 
wave equation. Lack of information about the depth and nature of the 
bottom can also cause errors in the signal processing. In some cases it is 
possible to use acoustic energy from known sources to determine the needed 
information. 

Array signal processing can be done in passive or active mode. In passive 
mode the energy is either reflected off of or originates at the object of 
interest: the moon reflects sunlight, while ships generate their own noise. 
In the active mode the object of interest does not generate or reflect enough 
energy by itself, so the energy is generated by the party doing the sensing: 
active sonar is sometimes used to locate quiet vessels, while radar is used 
to locate planes in the sky or to map the surface of the earth. Near-earth 
asteroids are initially detected by passive optical observation, as small dots 
of reflected sunlight; once detected, they are then imaged by active radar 
to determine their size, shape, rotation and such. 

Previously we considered the array processing problem in the context 
of plane-wave propagation. When the environment is more complicated, 
the wave equation must be modified to reflect the physics of the situation 
and the signal processing modified to incorporate that physics. A good 
example of such modification is provided by acoustic signal processing in 
shallow water, the subject of the rest of this chapter. 


24.3 The Shallow-Water Case 


In the shallow-water situation the acoustic energy from the source in- 
teracts with the surface and with the bottom of the channel, prior to being 
received by the sensors. The nature of this interaction is described by the 
wave equation in cylindrical coordinates. The deviation from the ambient 
pressure is the function p(t,s) = p(t,r, z,@), where s = (r,z,@) is the spa- 
tial vector variable, r is the range, z the depth, and 0 the bearing angle in 
the horizontal. We assume a single frequency, w, so that 


p(t,s) = e* g(r, 2,8). 


We shall assume cylindrical symmetry to remove the 0 dependence; in many 
applications the bearing is essentially known or limited by the environment 
or can be determined by other means. The sensors are usually positioned 
in a vertical array in the channel, with the top of the array taken to be 
the origin of the coordinate system and positive z taken to mean positive 
depth below the surface. We shall also assume that there is a single source 
of acoustic energy located at range rs and depth zs. 
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To simplify a bit, we assume here that the sound speed c = c(z) does not 
change with range, but only with depth, and that the channel has constant 
depth and density. Then, the Helmholtz equation for the function g(r, z) is 


V?g(r, z) + [w/clz)}*9(r, z) = 0. 


The Laplacian is 


1 
V7 9(r, z) = grr(r,z) + =9r(r, 2) + gez(7, 2). 
We separate the variables once again, writing 


g(r, z) = f(r)u(z). 


Then, the range function f(r) must satisfy the differential equation 


i") +F) = —af (0), 

and the depth function u(z) satisfies the differential equation 
u” (z) + k(z)?u(z) = au(z), 

where a is a separation constant and 


k(z)” = [w/e(2)}. 


Taking à? = a, the range equation becomes 


FE) + ESE) + PF) =0, 


which is Bessel’s equation, with Hankel-function solutions. The depth equa- 
tion becomes 
u” (z) + (k(z)* — A*)u(z) = 0, 

which is of Sturm-Liouville type. The boundary conditions pertaining to 
the surface and the channel bottom will determine the values of A for which 
a solution exists. 

To illustrate the way in which the boundary conditions become involved, 
we consider two examples. 


24.4 The Homogeneous-Layer Model 


We assume now that the channel consists of a single homogeneous layer 
of water of constant density, constant depth d, and constant sound speed 
c. We impose the following boundary conditions: 
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1. Pressure-release surface: u(0) = 0; 
2. Rigid bottom: u/(d) = 0. 
With 7? = (k? — \7), we get cos(yd) = 0, so the permissible values of À are 
Am = (k? — [(2m — 1)x/2d]?)'/2, m = 1, 2,.... 


The normalized solutions of the depth equation are now 


Um(Z) = V/2/dsin(Ym2), 


where 
Ym = Vk? — r2, = (2m — 1)r/2d, m = 1, 2,.... 


For each m the corresponding function of the range satisfies the differential 
equation 


A") FF) FASO), 


which has solution HO (Amr), where AY is the zeroth order Hankel- 
function solution of Bessel’s equation. The asymptotic form for this function 


a TiH§” (Amt) = 2m] Xm exp (- (nr ü 1) . 


It is this asymptotic form that is used in practice. Note that when Am is 
complex with a negative imaginary part, there will be a decaying exponen- 
tial in this solution, so this term will be omitted in the signal processing. 

Having found the range and depth functions, we write g(r, z) as a su- 
perposition of these elementary products, called the modes: 


M 
g(r,z) = D AmH§? (Amr)um (2), 
m=1 


where M is the number of propagating modes free of decaying exponentials. 
The Am can be found from the original Helmholtz equation; they are 


Am = (1/4)Um(Zs), 


where zs is the depth of the source of the acoustic energy. Notice that 
the depth of the source also determines the strength of each mode in this 
superposition; this is described by saying that the source has excited certain 
modes and not others. 

The eigenvalues Am of the depth equation will be complex when 


Ds, (2m — 1)r 
c 2d 
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If w is below the cut-off frequency 55, then all the Am are complex and there 
are no propagating modes (M = 0). The number of propagating modes is 
1 d 
M==4—, 
2 rC 


which is ł plus the depth of the channel in units of half-wavelengths. 
This model for shallow-water propagation is helpful in revealing a num- 

ber of the important aspects of modal propagation, but is of limited prac- 

tical utility. A more useful and realistic model is the Pekeris waveguide. 


24.5 The Pekeris Waveguide 


Now we assume that the water column has constant depth d, sound 
speed c, and density b. Beneath the water is an infinite half-space with 
sound speed c’ > c, and density b’. Figure 24.1 illustrates the situation. 

Using the new depth variable v = ~, the depth equation becomes 

d 
u” (v) + ?u(v) = 0, ford <u < 2 
c 
and . P 
u” (v) + ((5) —1+ x) u(v) = 0, for ac. 
c c 


To have a solution, A must satisfy the equation 


tan(Awd/c) = —(Ab/b')/4/1— (5) aes 


with 
c\2 
1-(£)’-? 20, 
c 
The trapped modes are those whose corresponding A satisfies 


isis (4): 


cd 


The eigenfunctions are 


d 
Um(v) =sin(Amv), ford < v < ak 
c 


2 
Um(v) 0 ( vy/1 (5) x), for 2 <v. 


Although the Pekeris model has its uses, it still may not be realistic enough 
in some cases and more complicated propagation models will be needed. 


and 
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FIGURE 24.1: The Pekeris Model. 
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24.6 The General Normal-Mode Model 


Regardless of the model by which the modal functions are determined, 
the general normal-mode expansion for the -independent case is 


M 


g(r, z) = 2 Um(2)8m(7; zs), 


where M is the number of propagating modes and s,,(r, zs) is the modal 
amplitude containing all the information about the source of the sound. 


24.6.1 Matched-Field Processing 


In plane-wave array processing we write the acoustic field as a superposi- 
tion of plane-wave fields and try to find the corresponding amplitudes. This 
can be done using a matched filter, although high-resolution methods can 
also be used. In the matched-filter approach, we fix a wavevector and then 
match the data with the vector that describes what we would have received 
at the sensors had there been but a single plane wave present correspond- 
ing to that fixed wavevector; we then repeat for other fixed wavevectors. 
In more complicated acoustic environments, such as normal-mode propa- 
gation in shallow water, we write the acoustic field as a superposition of 
fields due to sources of acoustic energy at individual points in range and 
depth and then seek the corresponding amplitudes. Once again, this can 
be done using a matched filter. 

In matched-field processing we fix a particular range and depth and 
compute what we would have received at the sensors had the acoustic field 
been generated solely by a single source at that location. We then match the 
data with this computed vector. We repeat this process for many different 
choices of range and depth, obtaining a function of r and z showing the 
likely locations of actual sources. As in the plane-wave case, high-resolution 
nonlinear methods can also be used. 

As in the plane-wave case, the performance of our processing methods 
can be degraded by incorrect description of the environment, as well as by 
phase errors and the like introduced by the hardware [28]. Once again, it is 
necessary to seek out those locations within the data where the information 
we seek is less disturbed by such errors [32, 33]. 

Good sources for more information concerning matched-field processing 
are the book by Tolstoy [155] and the papers [4], [18], [70], [91], [92], [139], 
[140], [141], [154], and [165]. 
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25.1 Chapter Summary 


In many of the examples we have considered in this book, the data has 
been finitely many linear-functional values of the function of interest. In this 
chapter we consider this problem from a purely mathematical perspective. 
We take the function of interest to be a member of a Hilbert space, and 
use best approximation to solve the problem. 


25.2 The Basic Problem 


We want to reconstruct a function f : RP — C from finitely many 
linear-functional values pertaining to that function. For example, we may 
want to reconstruct f from values f(xn) of f itself, or from Fourier- 
transform values F (yn). We adopt the view that f is a member of some 
infinite-dimensional Hilbert space H with inner product (-,-), and the data 
values are 


In = (F h”), (25.1) 


for n = 1,...,N, where the h” are known members of the Hilbert space. 
For example, suppose f(x) is supported on the interval [a,b] and we have 


377 
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Fourier-transform data, 


gn = Fn) -= f to ede = (f, e" j= f oe 


where e” (x) = e~**”. Because there are infinitely many solutions to our 
problem, we need some approach that singles out one solution. The most 
common approach is to select the estimate f of f that minimizes the norm 


fll = 4/ (ÔF, f), subject to f satisfying Equation (25.1); that is, 


gn = (Ê, h”). (25.2) 


We know that every element f of H can be written uniquely as 


N 
f= 5 amh” +u, 
m=1 


where (u, h”) = 0, for n = 1, ..., N. We may reasonably conclude from this 
that the probing or measuring of the function f that resulted in our data 
is incapable of telling us anything about u, so that we have no choice but 
to take the finite sum as our estimate of f. We then solve the system of 
linear equations 


for the am. In the case of Fourier-transform data, this approach leads to the 
DFT estimator. This argument has been offered several times by researchers 
who should know better. There is a flaw in this argument that we can 
exploit to obtain better estimates of f. To illustrate the point, we consider 
the problem of reconstructing f(x) from Fourier-transform data. 


25.3 Fourier-Transform Data 


Suppose f(x) is zero outside the interval [a,b] and our data are the 
values F'(yn), n = 1,..., N, where F(y) is the Fourier transform of f(x). It 
is reasonable to suppose that f(a) is a member of the Hilbert space L? (a, b) 


and the inner product is 
b — 
< f f(x)g(x)dz. (25.3) 
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With h(x) = e” (x) = e~**, we have 


gn = Fyn) = (f, €”). 


But there are other inner products that we can use to represent the data. 
Suppose that p(x) is a bounded positive function on [a,b], bounded away 
from zero, with w(x) = p(x)~', and we define a new inner product on 
L? (a,b) by 


b 
(ees I f(a)g(ayw(o)de. (25.4) 


Then we can represent tht data as 


b 
gS i f(a)e"(@)payw(a) dx = (f, yw, 


with 
t” (x) = e” (x)p(x). 
Arguing just as in the previous section, we may claim that the only reason- 


able estimator of f(x) is in the span of the functions t” (x), since we know 
that f(x) can be written uniquely as 


N 
f(z) = 5 bmt™ (x) + v(x), 


where 
(v, ie a 0, 


for n = 1,..., N. The resulting estimator is 
N 
f(z) = pla) X tae, 
m=1 


where the coefficients bm are found by forcing f(x) to be consistent with 
the inner product data; that is, the bm solve the system of linear equations 


b 


N 
gn = (f, )w = 5 im f plz) m?dr. 


m=i a 

The point we are making here is that, even after we have decided which 
Hilbert space to use, L?(a,b) in this example, there will still be infinitely 
many inner products that can be chosen to represent the data, and there- 
fore, infinitely many estimators of f(a), each one arguably the right choice. 
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25.4 The General Case 


Let H be our chosen ambient Hilbert space, which contains f, with 
given inner product (-,-). Let T : H — H be a continuous, linear, invertible 
operator. The adjoint of T, with respect to the original inner product, is 
TŻ, defined by 

(Tf,9) = (f,T"9). 
Define the T-inner product to be 
(f,g)r = (Tf, Tg). 
The adjoint of T, with respect to the T-inner product, is T*, defined by 
(TF, gr =(f,T"g)r. 
Ex. 25.1 Prove that T*T = TT", so that 
TESTIT, 
Then the data is 
In = (F, h”) = (TF, Th” yr = (F, TTR") 7 = (f, TTR") 7. 


Now we consider the reconstruction problem within the Hilbert space en- 
dowed with the T-inner product. 
With this new inner product, the minimum-norm estimate of f is 


N 
fay ere 
m=1 
with 
pi N 
gn = (f, TTR” c=, Gar Tis, 
m=1 
or 
N 
gn = > Gh TİR”). 
m=1 


With G the Gram matrix with entries 
Gin = (TIR TIR Y; 


we have to solve the system of linear equations 


N 
Gn = X G mnim 
m=1 


for n = 1,..., N. 
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25.5 Some Examples 


In this section we illustrate the general case with two examples. 


25.5.1 Choosing the Inner Product 


If the function f(x) to be estimated is support-limited to the interval 
[a,b], it is reasonable to assume that f(a) is a member of L?(a,b), with the 
inner product given by Equation (25.3). In this case, the operator T is just 
the identity operator. The minimum-norm estimator associated with this 
usual inner product has the form 


k N 
fæ) = X` amh” (2). 


As we saw in the case of Fourier-transform data, there may be other inner 
products on L?(a,b) that lead to better estimates of f(x); in particular, 
the inner product given by Equation (25.4) permits us to incorporate prior 
information about the function |f (x)| in the estimate. The minimum-norm 
estimate associated with this inner product has the form 


N 
F(x) = pla) Sr bmh” (2). 


In this case, the linear operator T is defined by 


T(f)(x) = Vp(@) f(x). 


In both cases, the coefficients are determined by making the estimator 
consistent with the data; that is, by satisfying Equation (25.2). 


25.5.2 Choosing the Hilbert Space 


We even have a choice to make in the selection of the Hilbert space 
itself. Suppose we know that f(z) is really zero outside the smaller interval 
[c,d] C [a,b]. We can select as H the space L?[c, d], or perhaps the closed 
subspace of all members of L?[a,b] that are zero outside [c,d]. If we take 
the view that, once we have changed the inner product we have already 
changed the Hilbert space, then there are still more Hilbert spaces we may 
use. 
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25.6 Summary 


The flaw in the original argument presented in the first section is that 
it assumes that the function f(x) is a member of only one Hilbert space, 
with only one inner product and norm to be dealt with, and that the 
linear-functional data must be represented using this single inner product. 
The minimum-norm solution is determined, once we settle on a particular 
Hilbert space and inner product, but we have a great deal of choice in se- 
lecting these. This is the stage at which we can incorporate prior knowledge 
to improve our estimator of f(x). 
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26.1 Chapter Summary 


In this appendix we survey, without proofs, some of the basic theorems 
concerning Fourier series and Fourier transforms. The discussion here is 
taken largely from [134] and [51]. There are many books, such as [80], that 
the reader interested in further details may consult. The book [101] is a 
delightful, if unconventional, journey through the theory and applications 
of Fourier analysis. 


26.2 Fourier Series 
Let f : [-L,L] > C. The Fourier series associated with the function f 


Co 


f(x) & 5 né ©”, 


n=— oo 


gems aft fl a 
The Nth partial sum is defined to be 


with 
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Convergence of the Fourier series involves the behavior of the sequence 
{Sy (x)} as N > co. 

It is known that, even if f can be extended to a 2L-periodic function that 
is everywhere continuous, there can be values of x, even a non-denumerable 
and everywhere dense set of x, at which the Fourier series fails to converge. 
However, it was shown by Carleson in 1966 that, under these conditions on 
f, the series will converge to f almost everywhere; that is, except on a set 
of Lebesgue measure zero. 

We can’t expect Sy (x) to converge to f(x) for all x, since, if f(x) and 
g(x) differ at only finitely many points, they have the same associated 
Fourier series. If both f and g are continuous and 2L-periodic, and the 
Fourier coefficients are the same, must f = g? The answer is yes, because 
of Fejer’s Theorem. 

Instead of considering Sy (x£), we consider 


1 


= Wa (50) + Si(x) +... + Sn(x)). 


on (x) 


We have the following theorem. 


Theorem 26.1 (Fejer’s Theorem) Let f have a continuous 2L-periodic 
extension. Then the sequence {an(x)} converges to f(x) uniformly. 


Corollary 26.1 If f and g both have continuous 2L-periodic extensions 
and their Fourier coefficients agree, then f = g. 


Theorem 26.2 If f has a continuous 2L-periodic extension, then 


L 
Jim |f(x) — Sw(a)|?da = 0, 
oo J_p 

and 
oo a 2 ~ 2 
=| Fois > el 


Definition 26.1 The function f is said to be Lipschitz continuous, or just 
Lipschitz, at x if there are constants M > 0 and ô > 0 such that |a—y| < ô 


implies | f(a) — f(y) < |z — yl. 
Theorem 26.3 If f is Lipschitz at x then Sn(x) > f(x). 
Corollary 26.2 If f is differentiable at x, then Sn(x) > f(x). 


Proof: Since f is differentiable at x it is also Lipschitz at x. | 
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26.3 Fourier Transforms 


In previous chapters it was our practice to treat the basic formulas for 
a Fourier-transform pair, 


F(y) = J Eede, (26.1) 


and 
1 


~ On 


Fa) = = f Poea, (26.2) 
as formal expressions, rather than as universally valid statements. Theo- 
rems concerning the validity of these expressions must always include as- 
sumptions about the properties of f and F, and about the nature of the 
integrals involved. 

In the theory of Riemann integration the two symbols 


+00 
/ f(x)dx (26.3) 


and 


b 
lim / f(a)dax (26.4) 
b—+00 à 

are equivalent; in the theory of Lebesgue integration they are different. In 
the Lebesgue theory, the integral in Equation (26.3) involves two approxi- 
mations done simultaneously; we approximate the function f by a sequence 
of step functions, while at the same time extending the domain of the step 
functions to infinity. In Equation (26.4) the two limiting processes are done 
sequentially; first approximate f by step functions on [a,b] to get the in- 
tegral, and then take the limit, as b approaches infinity. For example, the 
function f(x) = “28 on [0, +00) is not Lebesgue integrable, since its pos- 
itive and negative parts are not separately Lebesgue integrable, but the 


Rieman integral is 
to sing T 
dz = —, 
0 x 2 


which can be shown using the theory of residues. 


Definition 26.2 Let 1 < p< +00. A function f :R—- C is said to be in 
the class LP if f is measurable in the sense of Lebesgue and the Lebesgue 


integral 
J_d 


—00O 
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is finite. Functions f in L! are said to be absolutely integrable; functions 
f in L? are square integrable. 


If f is in L}, then the integral in Equation (26.1) exists for all y and 
defines a bounded, continuous function on the whole of R. If, in addition, 
the function F is in Z', then the integral in Equation (26.2) also exists 
for all x and defines a bounded, continuous function that is equal, almost 
everywhere, to the original f. In general, however, F need not be a member 
of L', and more complicated efforts are needed to give meaning to Equation 
(26.2). 

If f is in L?, then the limit 


F(y) = lim ce f(x)e""*dx) 


A—> +0 


exists, in the L? sense, and defines the Fourier transform of f as a member 
of L?. In addition, the limit 


$ 1 a —iyr 
ra= ie (2 [oe 
also exists, in the L? sense, and provides the inversion formula. 

In order for the spaces L! and L? to be complete as metric spaces, the 
members of Lt and L? are not individual functions, but equivalence classes 
of functions. Two functions f and g are equivalent if the function f — g is 
equal to zero, except possibly on a set of measure zero. However, we shall 
continue to speak of the members of these spaces as functions. 


26.4 Functions in the Schwartz Class 


As we just discussed, the integrals in Equations (26.1) and (26.2) may 
have to be interpreted carefully if they are to be applied to fairly general 
classes of functions f(x) and F(y). In this section we describe a class of 
functions for which these integrals can be defined. 

If both f(x) and F(7) are measurable and absolutely integrable then 
both functions are continuous. To illustrate some of the issues involved, we 
consider the functions in the Schwartz class [80] 

A function f(a) is said to be in the Schwartz class, or to be a Schwartz 
function, if f(a) is infinitely differentiable and 


jal" £™ (x) > 0, 
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as |z| + +oo. Here f™ (x) denotes the nth derivative of f(x). An example 
of a Schwartz function is f(x) = e7% 
STe? 4. The following proposition tells us that Schwartz functions are 
absolutely integrable on the real line, and so the Fourier transform is well 


defined. 


, with Fourier transform F(y) = 


Proposition 26.1 If f(x) is a Schwartz function, then 


L | f(x)dx < +00. 


—oco 


Proof: There is a constant M > 0 such that |x|?|f(x)| < 1, for |z| > M. 
Then 


oo M 
/ foldes | feae | lal ee 408. 


—oo |x| >M 


If f(x) is a Schwartz function, then so is its Fourier transform. To prove 
the Fourier Inversion Formula it is sufficient to show that 


Co 


f(0) = Í F(y)dy/2n. 


—oo 


Write 


2 2 


f(a) = f(O) + (F(x) — Fe”) = f0)? +.g(2). (26.5) 


Then g(0) = 0, so g(x) = xh(x), where h(x) = g(x)/ax is also a Schwartz 
function. Then the Fourier transform of g(x) is the derivative of the Fourier 
transform of h(x); that is, 


G(y) = H"(). 


The function H(y) is a Schwartz function, so it goes to zero at the infini- 
ties. Computing the Fourier transform of both sides of Equation (26.5), we 


obtain : 
F(y) = f0) vre t + H'(y). 


Therefore, 


ie F(y)dy = 27 f (0) + H(+00) — H(—2) = 27r f (0). 


—0O 


To prove the Fourier Inversion Formula, we let K(y) = F(y)e~*”°7, for 
fixed xo. Then the inverse Fourier transform of K (y) is k(x) = f(x + zo), 
and therefore 


n K(y)dy = 2rk(0) = 2r f (xo). (26.6) 
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26.5 Generalized Fourier Series 


Let H be a Hilbert space, with inner product (-,-), and {¢', ¢?,...} an 
orthonormal basis for H. Let f be a member of H. Then there are unique 
coefficients c1, C2, ... such that the generalized Fourier series converges to f; 


that is, 
Co 
= do end"(2) 
n=1 
The coefficients are given by 


Cn = (f, o”). 
Let the Nth partial sum of the series be 


N 
=D end"(2) 
n=1 
Then when we say that the series converges to f we mean that 
li — =0. 
Jim |f- Sal] =0 


The following exercise shows that the Nth partial sum is also a best ap- 
proximation of f. 


Ex. 26.1 Let X 
x) = S bno le) 
n=1 
for an arbitrary selection of the coefficients bn. Show that 


If — Sn|| < lf — Tyl, 
with equality if and only if bn = cn forn =1,...,N. 


26.6 Wiener Theory 


The study of periodic components of functions is one of the main topics 
in generalized harmonic analysis [163]. To analyze such functions Norbert 
Wiener focused on the autocorrelation function of f, defined by 


PR > eh or af. FOSE- r)dt 
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For example, let 
N 
gp ai 
Then we have 
N 
ae lan |e in 


and 


T 
[egg | _lim af rp(rie "7 dr. 
T>+œ 2T J_p 


Notice that the Fourier transform of r(7) is 


N 


Ry(w) = J. lanw — wn), 


n=1 


the power spectrum of the function f. In order to avoid involving delta 
functions, Wiener takes a different approach to analyzing the spectrum of 
f. 

In general, whenever the function rs(T) exists, the integrated spectrum 
of f is the function 


Let’s try to make sense of this definition. 
Let G(y) = x{0,4)(7) be the characteristic function of the interval [0, w]. 
Then the inverse Fourier transform of G(7) is 


1 er 1 1 seme — 1 
ee E 


2r = —1i2x 2T ix 


When the Parseval-Plancherel Equation (2.9) holds, we have 


sew) =n | rair = | Ryan, 
so that S’(w) = Ry(w). In such cases, S(w) is differentiable, S' (w) = Ry (w) 
is nonnegative, and $’(w) is the power spectrum or spectral density function 
of f. When the function f contains periodic components, the function S(w) 
will have discontinuities, which is why Wiener focuses on S(w), rather than 
on S’(w). 
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27.1 Chapter Summary 


A nice application of Dirac delta-function models is the problem of 
reverberation and echo cancellation, as discussed in [116]. The received 
signal is viewed as a filtered version of the original and we want to remove 
the effects of the filter, thereby removing the echo. This leads to the problem 
of finding the inverse filter. A version of the echo cancellation problem arises 
in telecommunications, as discussed in [147] and [148]. 


27.2 The Echo Model 


Suppose that x(t) is the original transmitted signal and the received 
signal is 
y(t) = x(t) + a(t — d), 
where d > 0 is the delay present in the echo term. We assume that the echo 


term is weaker than the original signal, so we make 0 < a < 1. With the 
filter function h(t) defined by 


h(t) = 6(t) + a6(t — d) = ô(t) + ada(t), (27.1) 


where dg(t) = d(t — d), we can write y(t) as the convolution of x(t) and 
h(t); that is, 
y(t) = x(t) * h(t). 
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A more general model is used to describe reverberation: 


K 
h(t) = X` arô(t — dx), 
k=0 


with ao = 1, dọ = 0, and dk > 0 and 0 < ax < 1 for k =1,2,...,K. 
Our goal is to find a second filter, denoted h,(t), the inverse of h(t) in 
Equation (27.1), such that 


h(t) * hi(t) = ô(t), 
and therefore 
a(t) = y(t) x h(t). (27.2) 


For now, we use trial and error to find h,(t); later we shall use the Fourier 
transform. 


27.3 Finding the Inverse Filter 
As a first guess, let us try 
gilt) = d(t) — ada(t). 
Convolving g(t) with h(t), we get 
h(t) * gi(t) = d(t) * E(t) — a? salt) * alt). 
We need to find out what da(t) * da(t) is. 


Ex. 27.1 Use the sifting property of the Dirac delta and the definition of 
convolution to show that 


Salt) * Salt) = doa(t). 


The Fourier transform of da(t) is the function exp(idw), so that the 
Fourier transform of the convolution of dg(t) with itself is the square of 
exp(idw), or exp(i(2d)w). This tells us again that the convolution of da(t) 
with itself is d2q(t). Therefore, 


A(t) * gi(t) = 6(t) — a? ôa (t). 


We do not quite have what we want, but since 0 < a < 1, the a? is much 
smaller than a. 
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Suppose that we continue down this path, and take for our next guess 
the filter function go(t) given by 


g2 (t) = e(t) = ada(t) + a doa(t). 
We then find that 
h(t) * go(t) = S(t) + a? dsa(t); 


the coefficient is œa? now, which is even smaller, and the delay in the echo 
term has moved to 3d. We could continue along this path, but a final 
solution is beginning to suggest itself. 

Suppose that we define 


3 


It would then follow that 


h(t) * gn (t) = 6(t) — (1) T aH Swaal). 
The coefficient a+! goes to zero and the delay goes to infinity, as N — oo. 
This suggests that the inverse filter should be the infinite sum 


Co 


hi(t) = X (-1)" a Sna(t). (27.3) 


n=0 


Then Equation (27.2) becomes 


x(t) = y(t) — ay(t — d) + a? y(t — 2d) — a® y(t — 3d) +.... 


Obviously, to remove the echo completely in this manner we need infinite 
memory. 


Ex. 27.2 Assume that x(t) = 0 fort < 0. Show that the problem of re- 
moving the echo is simpler now. 


27.4 Using the Fourier Transform 


The Fourier transform of the filter function h(t) in Equation (27.1) is 


H(w) = 1 + aexp(idw). 
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If we are to have 


h(t) * hi(t) = 6(t), 


we must have 
A(w)Hi(w) = 1, 


where H;(w) is the Fourier transform of the inverse filter function h;(t) that 
we seek. It follows that 


H;(w) = (1+ aexp(idw))~’. 
Recalling the formula for the sum of a geometric progression, 


1 
l+r 


l-r4tr—-rP+.= 


9 


for |r| < 1, we find that we can write 
H;(w) = 1 — a exp(idw) + a? exp(i(2d)w) — a? exp(i(3d)w) + ..., 


which tells us that h;(t) is precisely as given in Equation (27.3). 


27.5 The Teleconferencing Problem 


In teleconferencing, each separate room is equipped with microphones 
for transmitting to the other rooms and loudspeakers for broadcasting what 
the people in the other rooms are saying. For simplicity, consider two rooms, 
the transmitting room (TR), in which people are currently speaking, and 
the receiving room (RR), where the people are currently listening to the 
broadcast from the TR. The RR also has microphones and the problem 
arises when the signal broadcast into the RR from the TR reaches the 
microphones in the RR and is broadcast back into the TR. If it reaches 
the microphones in the TR, it will be re-broadcast to the RR, creating an 
echo, or worse. 

The signal that reaches a microphone in the RR will depend on the 
signals broadcast into the RR from the TR, as well as on the acoustics of 
the RR and on the placement of the microphone in the RR; that is, it will 
be a filtered version of what is broadcast into the RR. The hope is to be 
able to estimate the filter, generate an approximation of what is about to be 
re-broadcast, and subtract the estimate prior to re-broadcasting, thereby 
reducing to near zero what is re-broadcast back to the TR. 

In practice, all signals are viewed as discrete time series, and all filters 
are taken to be finite impulse response (FIR) filters. Because the acoustics 
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of the RR are not known a priori, the filter that the RR imposes must 
be estimated. This is done adaptively, by comparing vectors of samples 
of the original transmissions with the filtered version that is about to be 
re-broadcast, as described in [148]. 
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